这是本节的多页打印视图。 点击此处打印.

返回本页常规视图.

Kubernetes学习笔记

1 - 云原生体系

1.1 - k8s知识体系

1. k8s知识体系

以下整理了k8s涉及的相关知识体系。

k8s体系

思维导图:https://www.processon.com/view/link/5d7f7d08e4b03461a3a937e2

2. k8s重点开源项目

TODO

1.2 - 12 Factor

以下主要介绍PaaS平台设计架构中使用到的方法论,统称为12-Factor(要素)

简介

软件通常会作为一种服务来交付,即软件即服务(SaaS)。12-Factor原则为构建SaaS应用提供了以下的方法论

  • 使用标准化流程自动配置,减少开发者的学习成本。
  • 和操作系统解耦,使其可以在各个系统间提供最大的移植性。
  • 适合部署在现代的云计算平台上,从而在服务器和系统管理方面节省资源。
  • 将开发环境与生产环境的差异降至最低,并使用持续交付实施敏捷开发。
  • 可以在工具、架构和开发流程不发生明显变化的前提下实现拓展

该理论适应于任何语言和后端服务(数据库、消息队列、缓存等)开发的应用程序。

1. 基准代码

一份基准代码,多份部署

应用代码使用版本控制系统来管理,常用的有GitSVN等。一份用来跟踪代码所有修订版本的数据库称为代码库

1.1. 一份基准代码

基准代码和应用之间总是保持一一对应的关系:

  • 一旦有多个基准代码,则不能称之为一个应用,而是一个分布式系统。分布式系统中的每个组件都是一个应用,每个应用都可以使用12-Factor原则进行开发。
  • 多个应用共享一份基准代码有悖于12-Factor原则。解决方法是将共享的代码拆成独立的类库,通过依赖管理去使用它们。

1.2. 多份部署

每个应用只对应一份基准代码,但可以同时存在多份的部署,每份部署相当于运行了一个应用的实例。

多份部署的区别在于:

  • 可以存在不同的配置文件对应不同的环境。例如开发环境、预发布环境、生产环境等。
  • 可以使用不同的版本。例如开发环境的版本可能高于预发布环境版本,还没同步到预发布环境版本,同理,预发布环境版本可能高于生产环境版本。

2. 依赖

显式声明依赖关系

大多数的编程语言都会提供一个包管理系统或工具,其中包含所有的依赖库,例如Golang的vendor目录存放了该应用的所有依赖包。

12-Factor原则下的应用会通过依赖清单来显式确切地声明所有的依赖项。在运行工程中通过依赖隔离工具来保证应用不会去调用系统中存在但依赖清单中未声明的依赖项。

显式声明依赖项的优点在于可以简化环境配置流程,开发者关注应用的基准代码,而依赖库则由依赖库管理工具来管理和配置。例如,Golang中的包管理工具dep等。

3. 配置

在环境中存储配置

通常,应用的配置在不同的发布环境中(例如:开发、预发布、生产环境)会有很大的差异,其中包括:

  • 数据库、Redis等后端服务的配置
  • 每份部署特有的配置,例如域名
  • 第三方服务的证书等

12-Factor原则要求代码和配置严格分离,而不应该通过代码常量的形式写在代理里面。配置在不同的部署环境中存在大幅差异,但是代码却是完全一致的。

判断一个应用是否正确地将配置排除在代码外,可以看应用的基准代码是否可以立即开源而不担心暴露敏感信息。

12-Factor原则建议将应用的配置存储在环境变量中,环境变量可以方便在不同的部署环境中修改,而不侵入原有的代码。(例如,k8s的大部分代码配置是通过环境变量的方式来传入的)。

12-Factor应用中,环境变量的粒度要足够小且相对独立。当应用需要拓展时,可以平滑过渡。

4. 后端服务

把后端服务当作附加资源

后端服务指程序运行时所需要通过网络调用的各种服务,例如:数据库(MySQLCouchDB),消息/队列系统(RabbitMQBeanstalkd),SMTP 邮件发送服务(Postfix),以及缓存系统(Memcached)。

其中可以根据管理对象分为本地服务(例如本地数据库)和第三方服务(例如Amason S3)。对于12-Factor应用来说都是附加资源,没有区别对待,当其中一份后端服务失效后,可以通过切换到原先备份的后端服务中,而不需要修改代码(但可能需要修改配置)。12-Factor应用与后端服务保持松耦合的关系。

5. 构建,发布,运行

严格分离构建和运行

基准代码转化成一份部署需要经过三个阶段:

  • 构建阶段:指代码转化为可执行包的过程。构建过程会使用指定版本的代码,获取依赖项,编译生成二进制文件和资源文件。
  • 发布阶段:将构建的结果与当前部署所需的配置结合,并可以在运行环境中使用。
  • 运行阶段(运行时):指针对指定的发布版本在执行环境中启动一系列应用程序的进程。

12-Factor应用严格区分构建、发布、运行三个步骤,每一个发布版本对应一个唯一的发布ID,可以使用时间戳或递增的版本序列号。

如果需要修改则需要产生一个新的发布版本,如果需要回退,则回退到之前指定的发布版本。

新代码部署之前,由开发人员触发构建操作,构建阶段可以相对复杂一些,方便错误信息可以展示出来得到妥善处理。运行阶段可以人为触发或自动运行,运行阶段应该保持尽可能少的模块。

6. 进程

以一个或多个无状态进程运行应用

12-Factor应有的进程必须是无状态且无共享的,任何需要持久化的数据存储在后端服务中,例如数据库。

内存区域和磁盘空间可以作为进程的缓存,12-Factor应用不需要关注这些缓存的持久化,而是允许其丢失,例如重启的时候。

进程的二进制文件应该在构建阶段执行编译而不是运行阶段。

当应用使用到粘性Session,即将用户的session数据缓存到进程的内存中,将同一用户的后续请求路由到同一个进程。12-Factor应用反对这种处理方式,而是建议将session的数据保存在redis/memcached带有过期时间的缓存中。

7. 端口绑定

通过端口绑定提供服务

应用通过端口绑定来提供服务,并监听发送至该端口的请求。端口绑定的方式意味着一个应用也可以成为另一个应用的后端服务,例如提供某些API请求。

8. 并发

通过进程模型进行扩展

12-Factor应用中,开发人员可以将不同的工作分配给不同类型进程,例如HTTP请求由web进程来处理,常驻的后台工作由worker进程来处理(k8s的设计中就经常用不同类型的manager来处理不同的任务)。

12-Factor应用的进程具备无共享、水平分区的特性,使得水平扩展较为容易。

12-Factor应用的进程不需要守护进程或是写入PID文件,而是通过进程管理器(例如 systemd)来管理输出流,响应崩溃的进程,以及处理用户触发的重启或关闭超级进程的操作。

9. 易处理

快速启动和优雅终止可最大化健壮性

12-Factor应用的进程是易处理的,即它们可以快速的开启或停止,这样有利于快速部署迭代和弹性伸缩实例。

进程应该追求最小的启动时间,这样可以敏捷发布,增加健壮性,当出现问题可以快速在别的机器部署一个实例。

进程一旦接收到终止信号(SIGTERM)就会优雅终止。优雅终止指停止监听服务的端口,拒绝所有新的请求,并继续执行当前已接收的请求,然后退出。

进程还需在面对突然挂掉的情况下保持健壮性,例如通过任务队列的方式来解决进程突然挂掉而没有完成处理的事情,所以应该设计为任务执行是幂等的,可以被重复执行,重复执行的结果是一致的。

10. 开发环境与线上环境等价

尽可能的保持开发,预发布,线上环境相同

不同的发布环境可能存在以下差异:

  • 时间差异:开发到部署的周期较长。
  • 人员差异:开发人员只负责开发,运维人员只负责部署。分工过于隔离。
  • 工具差异:不同环境的配置和运行环境,使用的后端类型可能存在不同。

应尽量缩小本地与线上的差异,缩短上线周期,开发运维一体化,保证开发环境与线上运行的环境一致(例如,可以通过Docker容器的方式)。

11. 日志

把日志当作事件流

日志应该是事件流的汇总。12-Factor应用本身不考虑存储自己的日志输出流,不去写或管理日志文件,而是通过标准输出(stdout)的方式。

日志的标准输出流可以通过其他组件截获,整合其他的日志输出流,一并发给统一的日志中心处理,用于查看或存档。例如:日志收集开源工具Fluentd。

截获的日志流可以输出至文件,或者在终端实时查看。最重要的是可以发送到Splunk这样的日志索引及分析系统,提供后续的分析统计及监控告警等功能。例如:

  • 找出过去一段时间的特殊事件。
  • 图形化一个大规模的趋势,如每分钟的请求量。
  • 根据用户定义的条件触发告警,如每分钟报错数超过某个警戒线。

12. 管理进程

后台管理任务当作一次性进程运行

开发人员经常需要执行一些管理或维护应用的一次性任务,一次性管理进程应该和常驻进程使用相同的运行环境,开发人员可以通过ssh方式来执行一次性脚本或任务。

参考:

2 - 安装与配置

2.1 - 部署k8s集群

2.1.1 - 使用kubeadm安装生产环境kubernetes

本文为基于kubeadm搭建生产环境级别高可用的k8s集群。

1. 环境准备

1.0. master硬件配置

参考:

Kubernetes集群Master节点上运行着etcd、kube-apiserver、kube-controller等核心组件,对于Kubernetes集群的稳定性有着至关重要的影响,对于生产环境的集群,必须慎重选择Master规格。Master规格跟集群规模有关,集群规模越大,所需要的Master规格也越高。

说明 :可从多个角度衡量集群规模,例如节点数量、Pod数量、部署频率、访问量。这里简单的认为集群规模就是集群里的节点数量。

对于常见的集群规模,可以参见如下的方式选择Master节点的规格(对于测试环境,规格可以小一些。下面的选择能尽量保证Master负载维持在一个较低的水平上)。

节点规模 Master规格 磁盘
1~5个节点 4核8 GB(不建议2核4 GB)
6~20个节点 4核16 GB
21~100个节点 8核32 GB
100~200个节点 16核64 GB
1000个节点 32核128GB 1T SSD

注意事项:

  • 由于Etcd的性能瓶颈,Etcd的数据存储盘尽量选择SSD磁盘。

  • 为了实现多机房容灾,可将三台master分布在一个可用区下三个不同机房。(机房之间的网络延迟在10毫秒及以下级别)

  • 申请LB来做master节点的负载均衡实现高可用,LB作为apiserver的访问地址。

1.1. 设置防火墙端口策略

生产环境设置k8s节点的iptables端口访问规则。

1.1.1. master节点端口配置

协议 方向 端口范围 目的 使用者
TCP 入站 6443 Kubernetes API server 所有
TCP 入站 2379-2380 etcd server client API kube-apiserver, etcd
TCP 入站 10250 Kubelet API 自身, 控制面
TCP 入站 10259 kube-scheduler 自身
TCP 入站 10257 kube-controller-manager 自身

1.1.2. worker节点端口配置

协议 方向 端口范围 目的 使用者
TCP 入站 10250 Kubelet API 自身, 控制面
TCP 入站 30000-32767 NodePort Services 所有

添加防火墙iptables规则

master节点开放6443、2379、2380端口。

iptables -A INPUT -p tcp -m multiport --dports 6443,2379,2380,10250 -j ACCEPT

1.2. 关闭​​swap​​分区

[root@master ~]#swapoff -a
[root@master ~]#
[root@master ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:            976         366         135           6         474         393
Swap:             0           0           0

# swap 一栏为0,表示已经关闭了swap

1.3. 开启br_netfilter和bridge-nf-call-iptables

参考:https://imroc.cc/post/202105/why-enable-bridge-nf-call-iptables/

# 设置加载br_netfilter模块
cat <<EOF | sudo tee /etc/modules-load.d/k8s.conf
overlay
br_netfilter
EOF

sudo modprobe overlay
sudo modprobe br_netfilter

# 开启bridge-nf-call-iptables ,设置所需的 sysctl 参数,参数在重新启动后保持不变
cat <<EOF | sudo tee /etc/sysctl.d/k8s.conf
net.bridge.bridge-nf-call-iptables  = 1
net.bridge.bridge-nf-call-ip6tables = 1
net.ipv4.ip_forward                 = 1
EOF

# 应用 sysctl 参数而不重新启动
sudo sysctl --system

2. 安装容器运行时

在所有主机上安装容器运行时,推荐使用containerd为runtime。以下分别是containerd与docker的安装命令。

2.1. Containerd

1、参考:安装containerd

# for ubuntu
apt install -y containerd.io

2、生成默认配置

containerd config default > /etc/containerd/config.toml

3、修改CgroupDriver为systemd

k8s官方推荐使用systemd类型的CgroupDriver。

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  ...
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

4、重启containerd

systemctl restart containerd

2.2. Docker

# for ubuntu
apt install -y docker.io

官方建议配置cgroupdriver为systemd。

# 修改docker进程管理器
vi /etc/docker/daemon.json
{
"exec-opts": ["native.cgroupdriver=systemd"]
}
systemctl daemon-reload && systemctl restart docker
docker info | grep -i cgroup

2.3. Container Socket

运行时 Unix 域套接字
Containerd unix:///var/run/containerd/containerd.sock
CRI-O unix:///var/run/crio/crio.sock
Docker Engine (使用 cri-dockerd) unix:///var/run/cri-dockerd.sock

3. 安装kubeadm,kubelet,kubectl

在所有主机上安装kubeadm,kubelet,kubectl。最好版本与需要安装的k8s的版本一致。

# 以Ubuntu系统为例

# 安装仓库依赖
sudo apt-get update
sudo apt-get install -y apt-transport-https ca-certificates curl

# use google registry
sudo curl -fsSLo /usr/share/keyrings/kubernetes-archive-keyring.gpg https://packages.cloud.google.com/apt/doc/apt-key.gpg
echo "deb [signed-by=/usr/share/keyrings/kubernetes-archive-keyring.gpg] https://apt.kubernetes.io/ kubernetes-xenial main" | sudo tee /etc/apt/sources.list.d/kubernetes.list

# or use aliyun registry
curl -s https://mirrors.aliyun.com/kubernetes/apt/doc/apt-key.gpg | sudo apt-key add -
tee /etc/apt/sources.list.d/kubernetes.list <<EOF 
deb https://mirrors.aliyun.com/kubernetes/apt/ kubernetes-xenial main
EOF

# 安装指定版本的kubeadm, kubelet, kubectl
apt-get update
apt-get install -y kubelet=1.24.2-00 kubeadm=1.24.2-00 kubectl=1.24.2-00

# 查询有哪些版本
apt-cache madison kubeadm

离线下载安装

#!/bin/bash
Version=${Version:-1.24.2}
wget https://dl.k8s.io/release/v${Version}/bin/linux/amd64/kubeadm
wget https://dl.k8s.io/release/v${Version}/bin/linux/amd64/kubelet
wget https://dl.k8s.io/release/v${Version}/bin/linux/amd64/kubectl
chmod +x kubeadm kubelet kubectl
cp kubeadm kubelet kubectl /usr/bin/

# add kubelet serivce
cat > /lib/systemd/system/kubelet.service << EOF
[Unit]
Description=kubelet: The Kubernetes Node Agent
Documentation=https://kubernetes.io/docs/home/
Wants=network-online.target
After=network-online.target

[Service]
ExecStart=/usr/bin/kubelet
Restart=always
StartLimitInterval=0
RestartSec=10

[Install]
WantedBy=multi-user.target
EOF

mkdir -p /etc/systemd/system/kubelet.service.d
cat > /etc/systemd/system/kubelet.service.d/10-kubeadm.conf << EOF
# Note: This dropin only works with kubeadm and kubelet v1.11+
[Service]
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"
# This is a file that "kubeadm init" and "kubeadm join" generates at runtime, populating the KUBELET_KUBEADM_ARGS variable dynamically
EnvironmentFile=-/var/lib/kubelet/kubeadm-flags.env
# This is a file that the user can use for overrides of the kubelet args as a last resort. Preferably, the user should use
# the .NodeRegistration.KubeletExtraArgs object in the configuration files instead. KUBELET_EXTRA_ARGS should be sourced from this file.
EnvironmentFile=-/etc/default/kubelet
ExecStart=
ExecStart=/usr/bin/kubelet $KUBELET_KUBECONFIG_ARGS $KUBELET_CONFIG_ARGS $KUBELET_KUBEADM_ARGS $KUBELET_EXTRA_ARGS
EOF


systemctl daemon-reload
systemctl restart kubelet

4. 配置kubeadm config

参考:

4.1. 配置项说明

4.1.1. 配置类型

kubeadm config支持以下几类配置。

apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration

apiVersion: kubelet.config.k8s.io/v1beta1
kind: KubeletConfiguration

apiVersion: kubeproxy.config.k8s.io/v1alpha1
kind: KubeProxyConfiguration

apiVersion: kubeadm.k8s.io/v1beta3
kind: JoinConfiguration

可以使用以下命令打印init和join的默认配置。

kubeadm config print init-defaults
kubeadm config print join-defaults

4.1.2. Init配置

kubeadm init配置中只有InitConfigurationClusterConfiguration 是必须的。

InitConfiguration:

apiVersion: kubeadm.k8s.io/v1beta3
kind: InitConfiguration
bootstrapTokens:
  ...
nodeRegistration:
  ...
  • bootstrapTokens
  • nodeRegistration
    • criSocket:runtime的socket
    • name:节点名称
  • localAPIEndpoint
    • advertiseAddress:apiserver的广播IP
    • bindPort:k8s控制面安全端口

ClusterConfiguration:

apiVersion: kubeadm.k8s.io/v1beta3
kind: ClusterConfiguration
networking:
  ...
etcd:
  ...
apiServer:
  extraArgs:
    ...
  extraVolumes:
    ...
...
  • networking:

    • podSubnet:Pod CIDR范围
    • serviceSubnet: service CIDR范围
    • dnsDomain
  • etcd:

    • dataDir:Etcd的数据存储目录
  • apiserver

    • certSANs:设置额外的apiserver的域名签名证书
  • imageRepository:镜像仓库

  • controlPlaneEndpoint:控制面LB的域名

  • kubernetesVersion:k8s版本

4.2. Init配置示例

在master节点生成默认配置,并修改配置参数。

kubeadm config print init-defaults > kubeadm-config.yaml

修改配置内容

apiVersion: kubeadm.k8s.io/v1beta3
bootstrapTokens:
- groups:
  - system:bootstrappers:kubeadm:default-node-token
  token: abcdef.0123456789abcdef
  ttl: 24h0m0s
  usages:
  - signing
  - authentication
kind: InitConfiguration
localAPIEndpoint:
  advertiseAddress: 1.2.3.4 # 修改为apiserver的IP 或者去掉localAPIEndpoint则会读取默认IP。
  bindPort: 6443
nodeRegistration:
  criSocket: unix:///var/run/containerd/containerd.sock
  imagePullPolicy: IfNotPresent
  name: node
  taints: null
---
apiServer:
  certSANs:
  - lb.k8s.domain  # 添加额外的apiserver的域名
  - <vip/lb_ip>
  timeoutForControlPlane: 4m0s
apiVersion: kubeadm.k8s.io/v1beta3
certificatesDir: /etc/kubernetes/pki
clusterName: kubernetes
controllerManager: {}
dns: {}   # 默认为coredns
etcd:
  local:
    dataDir: /data/etcd   # 修改etcd的存储盘目录
imageRepository: k8s.gcr.io  # 修改镜像仓库地址
controlPlaneEndpoint: lb.k8s.domain  # 修改控制面域名
kind: ClusterConfiguration
kubernetesVersion: 1.24.0  # k8s 版本
networking:
  dnsDomain: cluster.local
  serviceSubnet: 10.96.0.0/12
  podSubnet: 10.244.0.0/16  # 设置pod的IP范围
scheduler: {}
---
kind: KubeletConfiguration
apiVersion: kubelet.config.k8s.io/v1beta1
cgroupDriver: systemd   # 设置为systemd

安装完成后可以查看kubeadm config

kubectl get cm -n kube-system kubeadm-config -oyaml

5. 安装Master控制面

提前拉取镜像:

kubeadm config images pull

5.1. 安装master

sudo kubeadm init --config kubeadm-config.yaml --upload-certs  --node-name <nodename>

部署参数说明:

  • --control-plane-endpoint:指定控制面(kube-apiserver)的IP或DNS域名地址。

  • --apiserver-advertise-address:kube-apiserver的IP地址。

  • --pod-network-cidr:pod network范围,控制面会自动给每个节点分配CIDR。

  • --service-cidr:service的IP范围,default "10.96.0.0/12"。

  • --kubernetes-version:指定k8s的版本。

  • --image-repository:指定k8s镜像仓库地址。

  • --upload-certs :标志用来将在所有控制平面实例之间的共享证书上传到集群。

  • --node-name:hostname-override,作为节点名称。

执行完毕会输出添加master和添加worker的命令如下:

...
You can now join any number of control-plane node by running the following command on each as a root:
    kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866 --control-plane --certificate-key f8902e114ef118304e561c3ecd4d0b543adc226b7a07f675f56564185ffe0c07

Please note that the certificate-key gives access to cluster sensitive data, keep it secret!
As a safeguard, uploaded-certs will be deleted in two hours; If necessary, you can use kubeadm init phase upload-certs to reload certs afterward.

Then you can join any number of worker nodes by running the following on each as root:
    kubeadm join 192.168.0.200:6443 --token 9vr73a.a8uxyaju799qwdjv --discovery-token-ca-cert-hash sha256:7c2e69131a36ae2a042a339b33381c6d0d43887e2de83720eff5359e26aec866

5.2. 添加其他master

添加master和添加worker的差别在于添加master多了--control-plane 参数来表示添加类型为master

kubeadm join <control-plane-endpoint>:6443 --token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane --certificate-key <certificate-key> \
--node-name <nodename>

6. 添加Node节点

kubeadm join <control-plane-endpoint>:6443 --token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--cri-socket /run/containerd/containerd.sock \
--node-name <nodename>

7. 安装网络插件

## 如果安装之后node的状态都改为ready,即为成功
wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml
kubectl apply -f ./kube-flannel.yml
kubectl get nodes

如果Pod CIDR的网段不是10.244.0.0/16,则需要加flannel配置中的网段更改为与Pod CIDR的网段一致。

7.1. 问题

  Warning  FailedCreatePodSandBox  4m6s                kubelet            Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "300d9b570cc1e23b6335c407b8e7d0ef2c74dc2fe5d7a110678c2dc919c62edf": plugin type="flannel" failed (add): failed to delegate add: failed to set bridge addr: "cni0" already has an IP address different from 10.244.3.1/24

原因:

宿主机节点有cni0网卡,且网卡的IP段与flannel的CIDR网段不同,因此需要删除该网卡,让其重建。

解决:

ifconfig cni0 down    
ip link delete cni0

8. 部署dashboard

kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/v2.5.0/aio/deploy/recommended.yaml

镜像: kubernetesui/dashboard:v2.5.0

默认端口:8443

9. 重置部署

# kubeadm重置
kubeadm reset

# 清空数据目录
rm -fr /data/etcd
rm -fr /etc/kubernetes
rm -fr ~/.kube/

删除flannel

ifconfig cni0 down
ip link delete cni0
ifconfig flannel.1 down
ip link delete flannel.1
rm -rf /var/lib/cni/
rm -f /etc/cni/net.d/*

10. 问题排查

10.1. kubeadm token过期

问题描述:

添加节点时报以下错误:

[discovery] The cluster-info ConfigMap does not yet contain a JWS signature for token ID "abcdef", will try again

原因:token过期,初始化token后会在24小时候会被master删除。

解决办法:

# 重新生成token
kubeadm token create --print-join-command
kubeadm token list

# kubeadm token create
oumnnc.aqlxuvdbntlvzoiv

# 重新生成hash
openssl x509 -pubkey -in /etc/kubernetes/pki/ca.crt | openssl rsa -pubin -outform der 2>/dev/null | openssl dgst -sha256 -hex | sed 's/^.* //'

基于新生成的token重新添加节点。

10.2. 修改kubeadm join的master IP或端口

kubeadm join命令会去kube-public命名空间获取名为cluster-infoConfigMap。如果需要修改kubeadm join使用的master的IP或端口,则需要修改cluster-info的configmap。

# 查看cluster-info
kubectl -n kube-public get configmaps cluster-info -o yaml

# 修改cluster-info
kubectl -n kube-public edit configmaps cluster-info

修改配置文件中的server字段

clusters:
- cluster:
    certificate-authority-data: xxx
    server: https://lb.k8s.domain:36443
  name: ""

执行kubeadm join的命令时指定新修改的master地址。

参考:

2.1.2 - 使用kubespray安装kubernetes

1. 环境准备

1.1. 部署机器

以下机器为虚拟机

机器IP 主机名 角色 系统版本 备注
172.16.94.140 kube-master-0 k8s master Centos 4.17.14 内存:3G
172.16.94.141 kube-node-41 k8s node Centos 4.17.14 内存:3G
172.16.94.142 kube-node-42 k8s node Centos 4.17.14 内存:3G
172.16.94.135 部署管理机 -

1.2. 配置管理机

管理机主要用来部署k8s集群,需要安装以下版本的软件,具体可参考:

ansible>=2.4.0
jinja2>=2.9.6
netaddr
pbr>=1.6
ansible-modules-hashivault>=3.9.4
hvac

1、安装及配置ansible

2、安装python-netaddr

# 安装pip
yum -y install epel-release
yum -y install python-pip
# 安装python-netaddr
pip install netaddr

3、升级Jinja

# Jinja 2.9 (or newer)
pip install --upgrade jinja2

1.3. 配置部署机器

部署机器即用来运行k8s集群的机器,包括MasterNode

1、确认系统版本

本文采用centos7的系统,建议将系统内核升级到4.x.x以上。

2、关闭防火墙

systemctl stop firewalld
systemctl disable firewalld
iptables -F

3、关闭swap

Kubespary v2.5.0的版本需要关闭swap,具体参考

- name: Stop if swap enabled
  assert:
    that: ansible_swaptotal_mb == 0
  when: kubelet_fail_swap_on|default(true)
  ignore_errors: "{{ ignore_assert_errors }}"

V2.6.0 版本去除了swap的检查,具体参考:

执行关闭swap命令swapoff -a

[root@master ~]#swapoff -a
[root@master ~]#
[root@master ~]# free -m
              total        used        free      shared  buff/cache   available
Mem:            976         366         135           6         474         393
Swap:             0           0           0

# swap 一栏为0,表示已经关闭了swap

4、确认部署机器内存

由于本文采用虚拟机部署,内存可能存在不足的问题,因此将虚拟机内存调整为3G或以上;如果是物理机一般不会有内存不足的问题。具体参考:

- name: Stop if memory is too small for masters
  assert:
    that: ansible_memtotal_mb >= 1500
  ignore_errors: "{{ ignore_assert_errors }}"
  when: inventory_hostname in groups['kube-master']

- name: Stop if memory is too small for nodes
  assert:
    that: ansible_memtotal_mb >= 1024
  ignore_errors: "{{ ignore_assert_errors }}"
  when: inventory_hostname in groups['kube-node']

1.4. 涉及镜像

Docker版本为17.03.2-ce

1、Master节点

镜像 版本 大小 镜像ID 备注
gcr.io/google-containers/hyperkube v1.9.5 620 MB a7e7fdbc5fee k8s
quay.io/coreos/etcd v3.2.4 35.7 MB 498ffffcfd05
gcr.io/google_containers/pause-amd64 3.0 747 kB 99e59f495ffa
quay.io/calico/node v2.6.8 282 MB e96a297310fd calico
quay.io/calico/cni v1.11.4 70.8 MB 4c4cb67d7a88 calico
quay.io/calico/ctl v1.6.3 44.4 MB 46d3aace8bc6 calico

2、Node节点

镜像 版本 大小 镜像ID 备注
gcr.io/google-containers/hyperkube v1.9.5 620 MB a7e7fdbc5fee k8s
gcr.io/google_containers/pause-amd64 3.0 747 kB 99e59f495ffa
quay.io/calico/node v2.6.8 282 MB e96a297310fd calico
quay.io/calico/cni v1.11.4 70.8 MB 4c4cb67d7a88 calico
quay.io/calico/ctl v1.6.3 44.4 MB 46d3aace8bc6 calico
gcr.io/google_containers/k8s-dns-dnsmasq-nanny-amd64 1.14.8 40.9 MB c2ce1ffb51ed dns
gcr.io/google_containers/k8s-dns-sidecar-amd64 1.14.8 42.2 MB 6f7f2dc7fab5 dns
gcr.io/google_containers/k8s-dns-kube-dns-amd64 1.14.8 50.5 MB 80cc5ea4b547 dns
gcr.io/google_containers/cluster-proportional-autoscaler-amd64 1.1.2 50.5 MB 78cf3f492e6b
gcr.io/google_containers/kubernetes-dashboard-amd64 v1.8.3 102 MB 0c60bcf89900 dashboard
nginx 1.13 109 MB ae513a47849c -

3、说明

  • 镜像被墙并且全部镜像下载需要较多时间,建议提前下载到部署机器上。
  • hyperkube镜像主要用来运行k8s核心组件(例如kube-apiserver等)。
  • 此处使用的网络组件为calico。

2. 部署集群

2.1. 下载kubespary的源码

git clone https://github.com/kubernetes-incubator/kubespray.git

2.2. 编辑配置文件

2.2.1. hosts.ini

hosts.ini主要为部署节点机器信息的文件,路径为:kubespray/inventory/sample/hosts.ini

cd kubespray
# 复制一份配置进行修改
cp -rfp inventory/sample inventory/k8s
vi inventory/k8s/hosts.ini

例如:

hosts.ini文件可以填写部署机器的登录密码,也可以不填密码而设置ssh的免密登录。

# Configure 'ip' variable to bind kubernetes services on a
# different ip than the default iface
# 主机名             ssh登陆IP                        ssh用户名               ssh登陆密码                 机器IP          子网掩码
kube-master-0     ansible_ssh_host=172.16.94.140   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.140   mask=/24
kube-node-41      ansible_ssh_host=172.16.94.141   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.141   mask=/24
kube-node-42      ansible_ssh_host=172.16.94.142   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.142   mask=/24

# configure a bastion host if your nodes are not directly reachable
# bastion ansible_ssh_host=x.x.x.x

[kube-master]
kube-master-0

[etcd]
kube-master-0

[kube-node]
kube-node-41
kube-node-42

[k8s-cluster:children]
kube-node
kube-master

[calico-rr]

2.2.2. k8s-cluster.yml

k8s-cluster.yml主要为k8s集群的配置文件,路径为:kubespray/inventory/k8s/group_vars/k8s-cluster.yml。该文件可以修改安装的k8s集群的版本,参数为:kube_version: v1.9.5。具体可参考:

2.3. 执行部署操作

涉及文件为cluster.yml

# 进入主目录
cd kubespray
# 执行部署命令
ansible-playbook -i inventory/k8s/hosts.ini cluster.yml -b -vvv

-vvv 参数表示输出运行日志

如果需要重置可以执行以下命令:

涉及文件为reset.yml

ansible-playbook -i inventory/k8s/hosts.ini reset.yml -b -vvv

3. 确认部署结果

3.1. ansible的部署结果

ansible命令执行完,出现以下日志,则说明部署成功,否则根据报错内容进行修改。

PLAY RECAP *****************************************************************************
kube-master-0              : ok=309  changed=30   unreachable=0    failed=0
kube-node-41               : ok=203  changed=8    unreachable=0    failed=0
kube-node-42               : ok=203  changed=8    unreachable=0    failed=0
localhost                  : ok=2    changed=0    unreachable=0    failed=0

以下为部分部署执行日志:

kubernetes/preinstall : Update package management cache (YUM) --------------------23.96s
/root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/main.yml:121 
kubernetes/master : Master | wait for the apiserver to be running ----------------23.44s
/root/gopath/src/kubespray/roles/kubernetes/master/handlers/main.yml:79 
kubernetes/preinstall : Install packages requirements ----------------------------20.20s
/root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/main.yml:203 
kubernetes/secrets : Check certs | check if a cert already exists on node --------13.94s
/root/gopath/src/kubespray/roles/kubernetes/secrets/tasks/check-certs.yml:17 
gather facts from all instances --------------------------------------------------9.98s
/root/gopath/src/kubespray/cluster.yml:25 
kubernetes/node : install | Compare host kubelet with hyperkube container --------9.66s
/root/gopath/src/kubespray/roles/kubernetes/node/tasks/install_host.yml:2 
kubernetes-apps/ansible : Kubernetes Apps | Start Resources -----------------------9.27s
/root/gopath/src/kubespray/roles/kubernetes-apps/ansible/tasks/main.yml:37 
kubernetes-apps/ansible : Kubernetes Apps | Lay Down KubeDNS Template ------------8.47s
/root/gopath/src/kubespray/roles/kubernetes-apps/ansible/tasks/kubedns.yml:3
download : Sync container ---------------------------------------------------------8.23s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:15 
kubernetes-apps/network_plugin/calico : Start Calico resources --------------------7.82s
/root/gopath/src/kubespray/roles/kubernetes-apps/network_plugin/calico/tasks/main.yml:2 
download : Download items ---------------------------------------------------------7.67s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6 
download : Download items ---------------------------------------------------------7.48s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6 
download : Sync container ---------------------------------------------------------7.35s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:15 
download : Download items ---------------------------------------------------------7.16s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6 
network_plugin/calico : Calico | Copy cni plugins from calico/cni container -------7.10s
/root/gopath/src/kubespray/roles/network_plugin/calico/tasks/main.yml:62 
download : Download items ---------------------------------------------------------7.04s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6
download : Download items ---------------------------------------------------------7.01s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6 
download : Sync container ---------------------------------------------------------7.00s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:15 
download : Download items ---------------------------------------------------------6.98s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6 
download : Download items ---------------------------------------------------------6.79s
/root/gopath/src/kubespray/roles/download/tasks/main.yml:6 

3.2. k8s集群运行结果

1、k8s组件信息

# kubectl get all --namespace=kube-system
NAME             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
ds/calico-node   3         3         3         3            3           <none>          2h

NAME                          DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/kube-dns               2         2         2            2           2h
deploy/kubedns-autoscaler     1         1         1            1           2h
deploy/kubernetes-dashboard   1         1         1            1           2h

NAME                                 DESIRED   CURRENT   READY     AGE
rs/kube-dns-79d99cdcd5               2         2         2         2h
rs/kubedns-autoscaler-5564b5585f     1         1         1         2h
rs/kubernetes-dashboard-69cb58d748   1         1         1         2h

NAME             DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
ds/calico-node   3         3         3         3            3           <none>          2h

NAME                          DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deploy/kube-dns               2         2         2            2           2h
deploy/kubedns-autoscaler     1         1         1            1           2h
deploy/kubernetes-dashboard   1         1         1            1           2h

NAME                                 DESIRED   CURRENT   READY     AGE
rs/kube-dns-79d99cdcd5               2         2         2         2h
rs/kubedns-autoscaler-5564b5585f     1         1         1         2h
rs/kubernetes-dashboard-69cb58d748   1         1         1         2h

NAME                                       READY     STATUS    RESTARTS   AGE
po/calico-node-22vsg                       1/1       Running   0          2h
po/calico-node-t7zgw                       1/1       Running   0          2h
po/calico-node-zqnx8                       1/1       Running   0          2h
po/kube-apiserver-kube-master-0            1/1       Running   0          22h
po/kube-controller-manager-kube-master-0   1/1       Running   0          2h
po/kube-dns-79d99cdcd5-f2t6t               3/3       Running   0          2h
po/kube-dns-79d99cdcd5-gw944               3/3       Running   0          2h
po/kube-proxy-kube-master-0                1/1       Running   2          22h
po/kube-proxy-kube-node-41                 1/1       Running   3          22h
po/kube-proxy-kube-node-42                 1/1       Running   3          22h
po/kube-scheduler-kube-master-0            1/1       Running   0          2h
po/kubedns-autoscaler-5564b5585f-lt9bb     1/1       Running   0          2h
po/kubernetes-dashboard-69cb58d748-wmb9x   1/1       Running   0          2h
po/nginx-proxy-kube-node-41                1/1       Running   3          22h
po/nginx-proxy-kube-node-42                1/1       Running   3          22h

NAME                       TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)         AGE
svc/kube-dns               ClusterIP   10.233.0.3     <none>        53/UDP,53/TCP   2h
svc/kubernetes-dashboard   ClusterIP   10.233.27.24   <none>        443/TCP         2h

2、k8s节点信息

# kubectl get nodes
NAME            STATUS    ROLES     AGE       VERSION
kube-master-0   Ready     master    22h       v1.9.5
kube-node-41    Ready     node      22h       v1.9.5
kube-node-42    Ready     node      22h       v1.9.5

3、组件健康信息

# kubectl get cs
NAME                 STATUS    MESSAGE              ERROR
scheduler            Healthy   ok
controller-manager   Healthy   ok
etcd-0               Healthy   {"health": "true"}

4. k8s集群扩容节点

4.1. 修改hosts.ini文件

如果需要扩容Node节点,则修改hosts.ini文件,增加新增的机器信息。例如,要增加节点机器kube-node-43(IP为172.16.94.143),修改后的文件内容如下:

# Configure 'ip' variable to bind kubernetes services on a
# different ip than the default iface
# 主机名             ssh登陆IP                        ssh用户名               ssh登陆密码                 机器IP          子网掩码
kube-master-0     ansible_ssh_host=172.16.94.140   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.140   mask=/24
kube-node-41      ansible_ssh_host=172.16.94.141   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.141   mask=/24
kube-node-42      ansible_ssh_host=172.16.94.142   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.142   mask=/24
kube-node-43      ansible_ssh_host=172.16.94.143   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.143   mask=/24

# configure a bastion host if your nodes are not directly reachable
# bastion ansible_ssh_host=x.x.x.x

[kube-master]
kube-master-0

[etcd]
kube-master-0

[kube-node]
kube-node-41
kube-node-42
kube-node-43

[k8s-cluster:children]
kube-node
kube-master

[calico-rr]

4.2. 执行扩容命令

涉及文件为scale.yml

# 进入主目录
cd kubespray
# 执行部署命令
ansible-playbook -i inventory/k8s/hosts.ini scale.yml -b -vvv

4.3. 检查扩容结果

1、ansible的执行结果

PLAY RECAP ***************************************
kube-node-41               : ok=228  changed=11   unreachable=0    failed=0
kube-node-42               : ok=197  changed=6    unreachable=0    failed=0
kube-node-43               : ok=227  changed=69   unreachable=0    failed=0 # 新增Node节点
localhost                  : ok=2    changed=0    unreachable=0    failed=0

2、k8s的节点信息

# kubectl get nodes
NAME            STATUS    ROLES     AGE       VERSION
kube-master-0   Ready     master    1d        v1.9.5
kube-node-41    Ready     node      1d        v1.9.5
kube-node-42    Ready     node      1d        v1.9.5
kube-node-43    Ready     node      1m        v1.9.5   #该节点为新增Node节点

可以看到新增的kube-node-43节点已经扩容完成。

3、k8s组件信息

# kubectl get po --namespace=kube-system -o wide
NAME                                    READY     STATUS    RESTARTS   AGE       IP               NODE
calico-node-22vsg                       1/1       Running   0          10h       172.16.94.140    kube-master-0
calico-node-8fz9x                       1/1       Running   2          27m       172.16.94.143    kube-node-43
calico-node-t7zgw                       1/1       Running   0          10h       172.16.94.142    kube-node-42
calico-node-zqnx8                       1/1       Running   0          10h       172.16.94.141    kube-node-41
kube-apiserver-kube-master-0            1/1       Running   0          1d        172.16.94.140    kube-master-0
kube-controller-manager-kube-master-0   1/1       Running   0          10h       172.16.94.140    kube-master-0
kube-dns-79d99cdcd5-f2t6t               3/3       Running   0          10h       10.233.100.194   kube-node-41
kube-dns-79d99cdcd5-gw944               3/3       Running   0          10h       10.233.107.1     kube-node-42
kube-proxy-kube-master-0                1/1       Running   2          1d        172.16.94.140    kube-master-0
kube-proxy-kube-node-41                 1/1       Running   3          1d        172.16.94.141    kube-node-41
kube-proxy-kube-node-42                 1/1       Running   3          1d        172.16.94.142    kube-node-42
kube-proxy-kube-node-43                 1/1       Running   0          26m       172.16.94.143    kube-node-43
kube-scheduler-kube-master-0            1/1       Running   0          10h       172.16.94.140    kube-master-0
kubedns-autoscaler-5564b5585f-lt9bb     1/1       Running   0          10h       10.233.100.193   kube-node-41
kubernetes-dashboard-69cb58d748-wmb9x   1/1       Running   0          10h       10.233.107.2     kube-node-42
nginx-proxy-kube-node-41                1/1       Running   3          1d        172.16.94.141    kube-node-41
nginx-proxy-kube-node-42                1/1       Running   3          1d        172.16.94.142    kube-node-42
nginx-proxy-kube-node-43                1/1       Running   0          26m       172.16.94.143    kube-node-43

5. 部署高可用集群

hosts.ini文件中的master和etcd的机器增加到多台,执行部署命令。

ansible-playbook -i inventory/k8s/hosts.ini cluster.yml -b -vvv

例如:

# Configure 'ip' variable to bind kubernetes services on a
# different ip than the default iface
# 主机名             ssh登陆IP                        ssh用户名               ssh登陆密码                 机器IP          子网掩码
kube-master-0     ansible_ssh_host=172.16.94.140   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.140   mask=/24
kube-master-1     ansible_ssh_host=172.16.94.144   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.144   mask=/24
kube-master-2     ansible_ssh_host=172.16.94.145   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.145   mask=/24
kube-node-41      ansible_ssh_host=172.16.94.141   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.141   mask=/24
kube-node-42      ansible_ssh_host=172.16.94.142   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.142   mask=/24
kube-node-43      ansible_ssh_host=172.16.94.143   ansible_ssh_user=root   ansible_ssh_pass=123  ip=172.16.94.143   mask=/24

# configure a bastion host if your nodes are not directly reachable
# bastion ansible_ssh_host=x.x.x.x

[kube-master]
kube-master-0
kube-master-1
kube-master-2

[etcd]
kube-master-0
kube-master-1
kube-master-2

[kube-node]
kube-node-41
kube-node-42
kube-node-43

[k8s-cluster:children]
kube-node
kube-master

[calico-rr]

6. 升级k8s集群

选择对应的k8s版本信息,执行升级命令。涉及文件为upgrade-cluster.yml

ansible-playbook upgrade-cluster.yml -b -i inventory/k8s/hosts.ini -e kube_version=v1.10.4 -vvv

7. troubles shooting

在使用kubespary部署k8s集群时,主要遇到以下报错。

7.1. python-netaddr未安装

  • 报错内容:
fatal: [node1]: FAILED! => {"failed": true, "msg": "The ipaddr filter requires python-netaddr be installed on the ansible controller"}
  • 解决方法:

需要安装 python-netaddr,具体参考上述[环境准备]内容。

7.2. swap未关闭

  • 报错内容:
fatal: [kube-master-0]: FAILED! => {
    "assertion": "ansible_swaptotal_mb == 0",
    "changed": false,
    "evaluated_to": false
}
fatal: [kube-node-41]: FAILED! => {
    "assertion": "ansible_swaptotal_mb == 0",
    "changed": false,
    "evaluated_to": false
}
fatal: [kube-node-42]: FAILED! => {
    "assertion": "ansible_swaptotal_mb == 0",
    "changed": false,
    "evaluated_to": false
}
  • 解决方法:

所有部署机器执行swapoff -a关闭swap,具体参考上述[环境准备]内容。

7.3. 部署机器内存过小

  • 报错内容:
TASK [kubernetes/preinstall : Stop if memory is too small for masters] *********************************************************************************************************************************************************************************************************
task path: /root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/verify-settings.yml:52
Friday 10 August 2018  21:50:26 +0800 (0:00:00.940)       0:01:14.088 *********
fatal: [kube-master-0]: FAILED! => {
    "assertion": "ansible_memtotal_mb >= 1500",
    "changed": false,
    "evaluated_to": false
}

TASK [kubernetes/preinstall : Stop if memory is too small for nodes] ***********************************************************************************************************************************************************************************************************
task path: /root/gopath/src/kubespray/roles/kubernetes/preinstall/tasks/verify-settings.yml:58
Friday 10 August 2018  21:50:27 +0800 (0:00:00.570)       0:01:14.659 *********
fatal: [kube-node-41]: FAILED! => {
    "assertion": "ansible_memtotal_mb >= 1024",
    "changed": false,
    "evaluated_to": false
}
fatal: [kube-node-42]: FAILED! => {
    "assertion": "ansible_memtotal_mb >= 1024",
    "changed": false,
    "evaluated_to": false
}
	to retry, use: --limit @/root/gopath/src/kubespray/cluster.retry
  • 解决方法:

调大所有部署机器的内存,本示例中调整为3G或以上。

7.4. kube-scheduler组件运行失败

kube-scheduler组件运行失败,导致http://localhost:10251/healthz调用失败。

  • 报错内容:
FAILED - RETRYING: Master | wait for kube-scheduler (1 retries left).
FAILED - RETRYING: Master | wait for kube-scheduler (1 retries left).
fatal: [node1]: FAILED! => {"attempts": 60, "changed": false, "content": "", "failed": true, "msg": "Status code was not [200]: Request failed: <urlopen error [Errno 111] Connection refused>", "redirected": false, "status": -1, "url": "http://localhost:10251/healthz"}
  • 解决方法:

可能是内存不足导致,本示例中调大了部署机器的内存。

7.5. docker安装包冲突

  • 报错内容:
failed: [k8s-node-1] (item={u'name': u'docker-engine-1.13.1-1.el7.centos'}) => {
    "attempts": 4,
    "changed": false,
    ...
    "item": {
        "name": "docker-engine-1.13.1-1.el7.centos"
    },
    "msg": "Error: docker-ce-selinux conflicts with 2:container-selinux-2.66-1.el7.noarch\n",
    "rc": 1,
    "results": [
        "Loaded plugins: fastestmirror\nLoading mirror speeds from cached hostfile\n * elrepo: mirrors.tuna.tsinghua.edu.cn\n * epel: mirrors.tongji.edu.cn\nPackage docker-engine is obsoleted by docker-ce, trying to install docker-ce-17.03.2.ce-1.el7.centos.x86_64 instead\nResolving Dependencies\n--> Running transaction check\n---> Package docker-ce.x86_64 0:17.03.2.ce-1.el7.centos will be installed\n--> Processing Dependency: docker-ce-selinux >= 17.03.2.ce-1.el7.centos for package: docker-ce-17.03.2.ce-1.el7.centos.x86_64\n--> Processing Dependency: libltdl.so.7()(64bit) for package: docker-ce-17.03.2.ce-1.el7.centos.x86_64\n--> Running transaction check\n---> Package docker-ce-selinux.noarch 0:17.03.2.ce-1.el7.centos will be installed\n---> Package libtool-ltdl.x86_64 0:2.4.2-22.el7_3 will be installed\n--> Processing Conflict: docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch conflicts docker-selinux\n--> Restarting Dependency Resolution with new changes.\n--> Running transaction check\n---> Package container-selinux.noarch 2:2.55-1.el7 will be updated\n---> Package container-selinux.noarch 2:2.66-1.el7 will be an update\n--> Processing Conflict: docker-ce-selinux-17.03.2.ce-1.el7.centos.noarch conflicts docker-selinux\n--> Finished Dependency Resolution\n You could try using --skip-broken to work around the problem\n You could try running: rpm -Va --nofiles --nodigest\n"
    ]
}
  • 解决方法:

卸载旧的docker版本,由kubespary自动安装。

sudo yum remove -y docker \
                  docker-client \
                  docker-client-latest \
                  docker-common \
                  docker-latest \
                  docker-latest-logrotate \
                  docker-logrotate \
                  docker-selinux \
                  docker-engine-selinux \
                  docker-engine

参考文章:

2.1.3 - 使用minikube安装kubernetes

以下内容基于Linux系统,特别为Ubuntu系统

1. 安装kubectl

curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/

下载指定版本,例如下载v1.9.0版本

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.9.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/

2. 安装minikube

minikube的源码地址:https://github.com/kubernetes/minikube

2.1 安装minikube

以下命令为安装latest版本的minikube

curl -Lo minikube https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/

安装指定版本可到https://github.com/kubernetes/minikube/releases下载对应版本。

例如:以下为安装v0.28.2版本

curl -Lo minikube https://storage.googleapis.com/minikube/releases/v0.28.2/minikube-linux-amd64 && chmod +x minikube && sudo mv minikube /usr/local/bin/

2.2 minikube命令帮助

Minikube is a CLI tool that provisions and manages single-node Kubernetes clusters optimized for development workflows.

Usage:
  minikube [command]

Available Commands:
  addons           Modify minikube's kubernetes addons
  cache            Add or delete an image from the local cache.
  completion       Outputs minikube shell completion for the given shell (bash or zsh)
  config           Modify minikube config
  dashboard        Opens/displays the kubernetes dashboard URL for your local cluster
  delete           Deletes a local kubernetes cluster
  docker-env       Sets up docker env variables; similar to '$(docker-machine env)'
  get-k8s-versions Gets the list of Kubernetes versions available for minikube when using the localkube bootstrapper
  ip               Retrieves the IP address of the running cluster
  logs             Gets the logs of the running localkube instance, used for debugging minikube, not user code
  mount            Mounts the specified directory into minikube
  profile          Profile sets the current minikube profile
  service          Gets the kubernetes URL(s) for the specified service in your local cluster
  ssh              Log into or run a command on a machine with SSH; similar to 'docker-machine ssh'
  ssh-key          Retrieve the ssh identity key path of the specified cluster
  start            Starts a local kubernetes cluster
  status           Gets the status of a local kubernetes cluster
  stop             Stops a running local kubernetes cluster
  update-check     Print current and latest version number
  update-context   Verify the IP address of the running cluster in kubeconfig.
  version          Print the version of minikube

Flags:
      --alsologtostderr                  log to standard error as well as files
  -b, --bootstrapper string              The name of the cluster bootstrapper that will set up the kubernetes cluster. (default "localkube")
      --log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --log_dir string                   If non-empty, write log files in this directory
      --loglevel int                     Log level (0 = DEBUG, 5 = FATAL) (default 1)
      --logtostderr                      log to standard error instead of files
  -p, --profile string                   The name of the minikube VM being used.
	This can be modified to allow for multiple minikube instances to be run independently (default "minikube")
      --stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
  -v, --v Level                          log level for V logs
      --vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging

Use "minikube [command] --help" for more information about a command.

3. 使用minikube安装k8s集群

3.1. minikube start

可以以Docker的方式运行k8s的组件,但需要先安装Docker(可参考Docker安装),启动参数使用--vm-driver=none

minikube start --vm-driver=none

例如:

root@ubuntu:~# minikube start --vm-driver=none
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Getting VM IP address...
Moving files into cluster...
Downloading kubeadm v1.10.0
Downloading kubelet v1.10.0
^[[DFinished Downloading kubelet v1.10.0
Finished Downloading kubeadm v1.10.0
Setting up certs...
Connecting to cluster...
Setting up kubeconfig...
Starting cluster components...
Kubectl is now configured to use the cluster.
===================
WARNING: IT IS RECOMMENDED NOT TO RUN THE NONE DRIVER ON PERSONAL WORKSTATIONS
	The 'none' driver will run an insecure kubernetes apiserver as root that may leave the host vulnerable to CSRF attacks

When using the none driver, the kubectl config and credentials generated will be root owned and will appear in the root home directory.
You will need to move the files to the appropriate location and then set the correct permissions.  An example of this is below:

	sudo mv /root/.kube $HOME/.kube # this will write over any previous configuration
	sudo chown -R $USER $HOME/.kube
	sudo chgrp -R $USER $HOME/.kube

	sudo mv /root/.minikube $HOME/.minikube # this will write over any previous configuration
	sudo chown -R $USER $HOME/.minikube
	sudo chgrp -R $USER $HOME/.minikube

This can also be done automatically by setting the env var CHANGE_MINIKUBE_NONE_USER=true
Loading cached images from config file.

安装指定版本的kubernetes集群

# 查阅版本
minikube get-k8s-versions
# 选择版本启动
minikube start --kubernetes-version v1.7.3 --vm-driver=none

3.2. minikube status

$ minikube status
minikube: Running
cluster: Running
kubectl: Correctly Configured: pointing to minikube-vm at 172.16.94.139

3.3. minikube stop

minikube stop 命令可以用来停止集群。 该命令会关闭 minikube 虚拟机,但将保留所有集群状态和数据。 再次启动集群将恢复到之前的状态。

3.4. minikube delete

minikube delete 命令可以用来删除集群。 该命令将关闭并删除 minikube 虚拟机。没有数据或状态会被保存下来。

4. 查看部署结果

4.1. 部署组件

root@ubuntu:~# kubectl get all --namespace=kube-system
NAME                                        READY     STATUS    RESTARTS   AGE
pod/etcd-minikube                           1/1       Running   0          38m
pod/kube-addon-manager-minikube             1/1       Running   0          38m
pod/kube-apiserver-minikube                 1/1       Running   1          39m
pod/kube-controller-manager-minikube        1/1       Running   0          38m
pod/kube-dns-86f4d74b45-bdfnx               3/3       Running   0          38m
pod/kube-proxy-dqdvg                        1/1       Running   0          38m
pod/kube-scheduler-minikube                 1/1       Running   0          38m
pod/kubernetes-dashboard-5498ccf677-c2gnh   1/1       Running   0          38m
pod/storage-provisioner                     1/1       Running   0          38m

NAME                           TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)         AGE
service/kube-dns               ClusterIP   10.96.0.10      <none>        53/UDP,53/TCP   38m
service/kubernetes-dashboard   NodePort    10.104.48.227   <none>        80:30000/TCP    38m

NAME                        DESIRED   CURRENT   READY     UP-TO-DATE   AVAILABLE   NODE SELECTOR   AGE
daemonset.apps/kube-proxy   1         1         1         1            1           <none>          38m

NAME                                   DESIRED   CURRENT   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/kube-dns               1         1         1            1           38m
deployment.apps/kubernetes-dashboard   1         1         1            1           38m

NAME                                              DESIRED   CURRENT   READY     AGE
replicaset.apps/kube-dns-86f4d74b45               1         1         1         38m
replicaset.apps/kubernetes-dashboard-5498ccf677   1         1         1         38m

4.2. dashboard

通过访问ip:port,例如:http://172.16.94.139:30000/,可以访问k8s的dashboard控制台。

<img src="http://res.cloudinary.com/dqxtn0ick/image/upload/v1533695750/article/kubernetes/arch/dashboard.png" width = "100%"/>

5. troubleshooting

5.1. 没有安装VirtualBox

[root@minikube ~]# minikube start
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
Downloading Minikube ISO
 160.27 MB / 160.27 MB [============================================] 100.00% 0s
E0727 15:47:08.655647    9407 start.go:174] Error starting host: Error creating host: Error executing step: Running precreate checks.
: VBoxManage not found. Make sure VirtualBox is installed and VBoxManage is in the path.

 Retrying.
E0727 15:47:08.656994    9407 start.go:180] Error starting host:  Error creating host: Error executing step: Running precreate checks.
: VBoxManage not found. Make sure VirtualBox is installed and VBoxManage is in the path
================================================================================
An error has occurred. Would you like to opt in to sending anonymized crash
information to minikube to help prevent future errors?
To opt out of these messages, run the command:
	minikube config set WantReportErrorPrompt false
================================================================================
Please enter your response [Y/n]:

解决方法,先安装VirtualBox。

5.2. 没有安装Docker

[root@minikube ~]# minikube start --vm-driver=none
Starting local Kubernetes v1.10.0 cluster...
Starting VM...
E0727 15:56:54.936706    9441 start.go:174] Error starting host: Error creating host: Error executing step: Running precreate checks.
: docker cannot be found on the path for this machine. A docker installation is a requirement for using the none driver: exec: "docker": executable file not found in $PATH.

 Retrying.
E0727 15:56:54.938930    9441 start.go:180] Error starting host:  Error creating host: Error executing step: Running precreate checks.
: docker cannot be found on the path for this machine. A docker installation is a requirement for using the none driver: exec: "docker": executable file not found in $PATH

解决方法,先安装Docker。

文章参考:

https://github.com/kubernetes/minikube

https://kubernetes.io/docs/setup/minikube/

https://kubernetes.io/docs/tasks/tools/install-minikube/

https://kubernetes.io/docs/tasks/tools/install-kubectl/

2.1.4 - 使用kind安装kubernetes

1. 安装kind

On mac or linux

curl -Lo ./kind "https://github.com/kubernetes-sigs/kind/releases/download/v0.7.0/kind-$(uname)-amd64"
chmod +x ./kind
mv ./kind /some-dir-in-your-PATH/kind

2. 创建k8s集群

$ kind create cluster
Creating cluster "kind" ...
 ✓ Ensuring node image (kindest/node:v1.17.0) 🖼
 ✓ Preparing nodes 📦
 ✓ Writing configuration 📜
 ✓ Starting control-plane 🕹️
 ✓ Installing CNI 🔌
 ✓ Installing StorageClass 💾
Set kubectl context to "kind-kind"
You can now use your cluster with:

kubectl cluster-info --context kind-kind

Not sure what to do next? 😅 Check out https://kind.sigs.k8s.io/docs/user/quick-start/

查看集群信息

$ kubectl cluster-info --context kind-kind
Kubernetes master is running at https://127.0.0.1:32768
KubeDNS is running at https://127.0.0.1:32768/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy

查看node

$ kubectl get node -o wide
NAME                 STATUS   ROLES    AGE   VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION                      CONTAINER-RUNTIME
kind-control-plane   Ready    master   35h   v1.17.0   172.17.0.2    <none>        Ubuntu 19.10   3.10.107-1-tlinux2_kvm_guest-0049   containerd://1.3.2

查看pod

$ kubectl get po --all-namespaces -o wide
NAMESPACE            NAME                                         READY   STATUS    RESTARTS   AGE   IP           NODE                 NOMINATED NODE   READINESS GATES
kube-system          coredns-6955765f44-lqk9v                     1/1     Running   0          35h   10.244.0.4   kind-control-plane   <none>           <none>
kube-system          coredns-6955765f44-zpsmc                     1/1     Running   0          35h   10.244.0.3   kind-control-plane   <none>           <none>
kube-system          etcd-kind-control-plane                      1/1     Running   0          35h   172.17.0.2   kind-control-plane   <none>           <none>
kube-system          kindnet-8mt7d                                1/1     Running   0          35h   172.17.0.2   kind-control-plane   <none>           <none>
kube-system          kube-apiserver-kind-control-plane            1/1     Running   0          35h   172.17.0.2   kind-control-plane   <none>           <none>
kube-system          kube-controller-manager-kind-control-plane   1/1     Running   0          35h   172.17.0.2   kind-control-plane   <none>           <none>
kube-system          kube-proxy-5w25s                             1/1     Running   0          35h   172.17.0.2   kind-control-plane   <none>           <none>
kube-system          kube-scheduler-kind-control-plane            1/1     Running   0          35h   172.17.0.2   kind-control-plane   <none>           <none>
local-path-storage   local-path-provisioner-7745554f7f-dckzr      1/1     Running   0          35h   10.244.0.2   kind-control-plane   <none>           <none>

docker ps

$ docker ps
CONTAINER ID        IMAGE                  COMMAND                  CREATED             STATUS              PORTS                       NAMES
93b291f99dd4        kindest/node:v1.17.0   "/usr/local/bin/entr…"   2 minutes ago       Up 2 minutes        127.0.0.1:32768->6443/tcp   kind-control-plane

3. kindest/node容器内进程

$ docker exec -it 93b291f99dd4 bash
root@kind-control-plane:/# ps auxw
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  0.0  19512  7480 ?        Ss   03:18   0:00 /sbin/init
root       105  0.0  0.0  26396  7344 ?        S<s  03:18   0:00 /lib/systemd/systemd-journald
root       141  2.3  0.3 2374736 51564 ?       Ssl  03:18   0:06 /usr/local/bin/containerd
root       325  0.0  0.0 112540  5036 ?        Sl   03:18   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 3f415d609e15ef12b9f53557891c311c156b912d5a326544a25c8b29cfa9d366 -address /run/containerd/containerd.sock
root       346  0.0  0.0   1012     4 ?        Ss   03:18   0:00 /pause
root       370  0.0  0.0 112540  5108 ?        Sl   03:18   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 1e1f3eed09f701fb621325e7b9e96d1c3de60ebd3bd64e0aec376e9490cf0e57 -address /run/containerd/containerd.sock
root       397  0.0  0.0 112540  4684 ?        Sl   03:18   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id c5e451089a1a5b3dfb2cc68ee27ac7d414285be55ecfdc5bd59180fbfbc7df2e -address /run/containerd/containerd.sock
root       424  0.0  0.0 112540  4924 ?        Sl   03:18   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 81e35f29ac8c2dda344125a10e3791be7ccf788a88f1efbc3397fa319f02881f -address /run/containerd/containerd.sock
root       443  0.0  0.0   1012     4 ?        Ss   03:18   0:00 /pause
root       458  0.0  0.0   1012     4 ?        Ss   03:18   0:00 /pause
root       465  0.0  0.0   1012     4 ?        Ss   03:18   0:00 /pause
root       548  0.7  0.1 145500 27724 ?        Ssl  03:18   0:02 kube-scheduler --authentication-kubeconfig=/etc/kubernetes/scheduler.conf --authorization-kubeconfig=/etc/kubernetes/scheduler.conf --bind-address=127.0.0.1 --kubeconfig=/etc/kubernetes/scheduler.conf --leader-elect=true
root       589  1.0  0.3 159536 54384 ?        Ssl  03:18   0:02 kube-controller-manager --allocate-node-cidrs=true --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf --bind-address=127.0.0.1 --client-ca-file=/etc/kubernetes/pki/ca.cr
root       613  3.8  1.6 445780 273484 ?       Ssl  03:18   0:10 kube-apiserver --advertise-address=172.17.0.2 --allow-privileged=true --authorization-mode=Node,RBAC --client-ca-file=/etc/kubernetes/pki/ca.crt --enable-admission-plugins=NodeRestriction --enable-bootstrap-token-auth=true --etcd-cafile=/etc/kubernetes/
root       660  1.4  0.2 10613604 37448 ?      Ssl  03:18   0:04 etcd --advertise-client-urls=https://172.17.0.2:2379 --cert-file=/etc/kubernetes/pki/etcd/server.crt --client-cert-auth=true --data-dir=/var/lib/etcd --initial-advertise-peer-urls=https://172.17.0.2:2380 --initial-cluster=kind-control-plane=https://172.
root       718  1.3  0.3 2084848 52772 ?       Ssl  03:18   0:03 /usr/bin/kubelet --bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf --config=/var/lib/kubelet/config.yaml --container-runtime=remote --container-runtime-endpoint=/run/containerd/containerd.sock --fail
root       876  0.0  0.0 112540  5084 ?        Sl   03:18   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id adfbea8fec5ac6986407291f5bfc5aecead176954e5dabbe1517b98dd77bf78b -address /run/containerd/containerd.sock
root       893  0.0  0.0 112540  4796 ?        Sl   03:18   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 53bdce023626b60ffaa0548b5888e457dc9c3bc45c7808a385dd0f63dcc90327 -address /run/containerd/containerd.sock
root       924  0.0  0.0   1012     4 ?        Ss   03:18   0:00 /pause
root       931  0.0  0.0   1012     4 ?        Ss   03:18   0:00 /pause
root      1000  0.0  0.0 127616 11100 ?        Ssl  03:18   0:00 /bin/kindnetd
root      1017  0.0  0.1 141060 19420 ?        Ssl  03:18   0:00 /usr/local/bin/kube-proxy --config=/var/lib/kube-proxy/config.conf --hostname-override=kind-control-plane
root      1066  0.0  0.0      0     0 ?        Z    03:18   0:00 [iptables-nft-sa] <defunct>
root      1080  0.0  0.0      0     0 ?        Z    03:18   0:00 [iptables-nft-sa] <defunct>
root      1241  0.0  0.0 112540  5156 ?        Sl   03:19   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 5cbd7bbe186cf5847786c7a03aa4c6f82e6c805d0a189f0f3e8fb1750594260d -address /run/containerd/containerd.sock
root      1262  0.0  0.0   1012     4 ?        Ss   03:19   0:00 /pause
root      1303  0.1  0.0 134372 14088 ?        Ssl  03:19   0:00 local-path-provisioner --debug start --helper-image k8s.gcr.io/debian-base:v2.0.0 --config /etc/config/config.json
root      1411  0.0  0.0 112540  4876 ?        Sl   03:19   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id 196b440345cb5ef47a6c31222323d35bbfef85d1d79c149ec0e3a6e22022a5f0 -address /run/containerd/containerd.sock
root      1437  0.0  0.0   1012     4 ?        Ss   03:19   0:00 /pause
root      1450  0.0  0.0 112540  4380 ?        Sl   03:19   0:00 /usr/local/bin/containerd-shim-runc-v2 -namespace k8s.io -id de7bdf052083978c78708383f842567d4fb38adff22a56792437a4de82425afe -address /run/containerd/containerd.sock
root      1480  0.0  0.0   1012     4 ?        Ss   03:19   0:00 /pause
root      1530  0.1  0.1 144324 19056 ?        Ssl  03:19   0:00 /coredns -conf /etc/coredns/Corefile
root      1531  0.1  0.1 144580 19204 ?        Ssl  03:19   0:00 /coredns -conf /etc/coredns/Corefile

参考:

2.2 - k8s证书及秘钥

1. 证书分类

  • 服务器证书:server cert,用于客户端验证服务端的身份。

  • 客户端证书:client cert,用于服务端验证客户端的身份。

  • 对等证书:peer cert(既是server cert又是client cert),用户成员之间的身份验证,例如 etcd。

1.1. k8s集群的证书分类

  • etcd节点:需要标识自己服务的server cert,也需要client certetcd集群其他节点交互,因此需要一个对等证书。
  • master节点:需要标识 apiserver服务的server cert,也需要client cert连接etcd集群,也需要一个对等证书。
  • kubelet:需要标识自己服务的server cert,也需要client cert请求apiserver,也使用一个对等证书。
  • kubectl、kube-proxy、calico:需要client证书。

2. CA证书及秘钥

目录:/etc/kubernetes/ssl

分类 证书/秘钥 说明 组件
ca ca-key.pem
ca.pem
ca.csr
Kubernetes kubernetes-key.pem
kubernetes.pem
kubernetes.csr
Admin admin-key.pem
admin.pem
admin.csr
Kubelet kubelet.crt
kubelet.key

配置文件

分类 证书/秘钥 说明
ca ca-config.json
ca-csr.json
Kubernetes kubernetes-csr.json
Admin admin-csr.json
Kube-proxy kube-proxy-csr.json

3. cfssl工具

安装cfssl:

# 下载cfssl
$ curl https://pkg.cfssl.org/R1.2/cfssl_linux-amd64 -o /usr/local/bin/cfssl
$ curl https://pkg.cfssl.org/R1.2/cfssljson_linux-amd64 -o /usr/local/bin/cfssljson
$ curl https://pkg.cfssl.org/R1.2/cfssl-certinfo_linux-amd64 -o /usr/local/bin/cfssl-certinfo

# 添加可执行权限
$ chmod +x /usr/local/bin/cfssl /usr/local/bin/cfssljson /usr/local/bin/cfssl-certinfo

4. 创建 CA (Certificate Authority)

4.1. 配置源文件

创建 CA 配置文件

ca-config.json

cat << EOF > ca-config.json
{
  "signing": {
    "default": {
      "expiry": "87600h"
    },
    "profiles": {
      "kubernetes": {
        "usages": [
            "signing",
            "key encipherment",
            "server auth",
            "client auth"
        ],
        "expiry": "876000h"
      }
    }
  }
}
EOF

参数说明

  • ca-config.json:可以定义多个 profiles,分别指定不同的过期时间、使用场景等参数;后续在签名证书时使用某个 profile;
  • signing:表示该证书可用于签名其它证书;生成的 ca.pem 证书中 CA=TRUE
  • server auth:表示client可以用该 CA 对server提供的证书进行验证;
  • client auth:表示server可以用该CA对client提供的证书进行验证;

创建 CA 证书签名请求

ca-csr.json

cat << EOF > ca-csr.json
{
  "CN": "kubernetes",
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [
    {
      "C": "CN",
      "ST": "ShenZhen",
      "L": "ShenZhen",
      "O": "k8s",
      "OU": "System"
    }
  ]
}
EOF

参数说明

ca-csr.json的参数

  • CN:Common Name,kube-apiserver 从证书中提取该字段作为请求的用户名 (User Name);浏览器使用该字段验证网站是否合法;

names中的字段:

  • C : country,国家
  • ST: state,州或省份
  • L:location,城市
  • O:organization,组织,kube-apiserver 从证书中提取该字段作为请求用户所属的组 (Group)
  • OU:organization unit

4.2. 执行命令

cfssl gencert -initca ca-csr.json | cfssljson -bare ca

输出如下:

# cfssl gencert -initca ca-csr.json | cfssljson -bare ca
2019/12/13 14:35:52 [INFO] generating a new CA key and certificate from CSR
2019/12/13 14:35:52 [INFO] generate received request
2019/12/13 14:35:52 [INFO] received CSR
2019/12/13 14:35:52 [INFO] generating key: rsa-2048
2019/12/13 14:35:52 [INFO] encoded CSR
2019/12/13 14:35:52 [INFO] signed certificate with serial number 248379771349454958117219047414671162179070747780

生成以下文件:

# 生成文件
-rw-r--r-- 1 root root 1005 12月 13 11:32 ca.csr
-rw------- 1 root root 1675 12月 13 11:32 ca-key.pem
-rw-r--r-- 1 root root 1363 12月 13 11:32 ca.pem
# 配置源文件
-rw-r--r-- 1 root root  293 12月 13 11:31 ca-config.json
-rw-r--r-- 1 root root  210 12月 13 11:31 ca-csr.json

5. 创建 kubernetes 证书

5.1. 配置源文件

创建 kubernetes 证书签名请求文件kubernetes-csr.json。

cat << EOF > kubernetes-csr.json
{
  "CN": "kubernetes",
  "hosts": [
    "127.0.0.1",
    "<MASTER_IP>",
    "<MASTER_CLUSTER_IP>", 
    "kubernetes",
    "kubernetes.default",
    "kubernetes.default.svc",
    "kubernetes.default.svc.cluster",
    "kubernetes.default.svc.cluster.local"
  ],
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [{
    "C": "<country>",
    "ST": "<state>",
    "L": "<city>",
    "O": "<organization>",
    "OU": "<organization unit>"
  }]
}
EOF

参数说明:

  • MASTER_IP:master节点的IP或域名
  • MASTER_CLUSTER_IPkube-apiserver 指定的 service-cluster-ip-range 网段的第一个IP,例如(10.254.0.1)。

5.2. 执行命令

$ cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kubernetes-csr.json | cfssljson -bare kubernetes

输出如下:

# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes kubernetes-csr.json | cfssljson -bare kubernetes
2019/12/13 14:40:28 [INFO] generate received request
2019/12/13 14:40:28 [INFO] received CSR
2019/12/13 14:40:28 [INFO] generating key: rsa-2048
2019/12/13 14:40:28 [INFO] encoded CSR
2019/12/13 14:40:28 [INFO] signed certificate with serial number 392795299385191732458211386861696542628305189374
2019/12/13 14:40:28 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").

生成以下文件:

# 生成文件
-rw-r--r-- 1 root root 1269 12月 13 14:40 kubernetes.csr
-rw------- 1 root root 1679 12月 13 14:40 kubernetes-key.pem
-rw-r--r-- 1 root root 1643 12月 13 14:40 kubernetes.pem
# 配置源文件
-rw-r--r-- 1 root root  580 12月 13 14:40 kubernetes-csr.json

6. 创建 admin 证书

6.1. 配置源文件

创建 admin 证书签名请求文件 admin-csr.json

cat << EOF > admin-csr.json
{
  "CN": "admin",
  "hosts": [],
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [
    {
      "C": "CN",
      "ST": "ShenZhen",
      "L": "ShenZhen",
      "O": "system:masters",
      "OU": "System"
    }
  ]
}
EOF

6.2. 执行命令

$ cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes admin-csr.json | cfssljson -bare admin

输出如下:

# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes admin-csr.json | cfssljson -bare admin
2019/12/13 14:52:37 [INFO] generate received request
2019/12/13 14:52:37 [INFO] received CSR
2019/12/13 14:52:37 [INFO] generating key: rsa-2048
2019/12/13 14:52:37 [INFO] encoded CSR
2019/12/13 14:52:37 [INFO] signed certificate with serial number 465422983473444224050765004141217688748259757371
2019/12/13 14:52:37 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").

生成文件

# 生成文件
-rw-r--r-- 1 root root 1013 12月 13 14:52 admin.csr
-rw------- 1 root root 1675 12月 13 14:52 admin-key.pem
-rw-r--r-- 1 root root 1407 12月 13 14:52 admin.pem
# 配置源文件
-rw-r--r-- 1 root root  231 12月 13 14:49 admin-csr.json

7. 创建 kube-proxy 证书

7.1. 配置源文件

创建 kube-proxy 证书签名请求文件 kube-proxy-csr.json

cat << EOF > kube-proxy-csr.json
{
  "CN": "system:kube-proxy",
  "hosts": [],
  "key": {
    "algo": "rsa",
    "size": 2048
  },
  "names": [
    {
      "C": "CN",
      "ST": "BeiJing",
      "L": "BeiJing",
      "O": "k8s",
      "OU": "System"
    }
  ]
}
EOF

7.2. 执行命令

$ cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes  kube-proxy-csr.json | cfssljson -bare kube-proxy

输出如下:

# cfssl gencert -ca=ca.pem -ca-key=ca-key.pem -config=ca-config.json -profile=kubernetes  kube-proxy-csr.json | cfssljson -bare kube-proxy
2019/12/13 19:37:48 [INFO] generate received request
2019/12/13 19:37:48 [INFO] received CSR
2019/12/13 19:37:48 [INFO] generating key: rsa-2048
2019/12/13 19:37:48 [INFO] encoded CSR
2019/12/13 19:37:48 [INFO] signed certificate with serial number 526712749765692443642491255093816136154324531741
2019/12/13 19:37:48 [WARNING] This certificate lacks a "hosts" field. This makes it unsuitable for
websites. For more information see the Baseline Requirements for the Issuance and Management
of Publicly-Trusted Certificates, v.1.1.6, from the CA/Browser Forum (https://cabforum.org);
specifically, section 10.2.3 ("Information Requirements").

生成文件:

# 生成文件
-rw-r--r-- 1 root root 1009 12月 13 19:37 kube-proxy.csr
-rw------- 1 root root 1675 12月 13 19:37 kube-proxy-key.pem
-rw-r--r-- 1 root root 1407 12月 13 19:37 kube-proxy.pem
# 配置源文件
-rw-r--r-- 1 root root  230 12月 13 19:37 kube-proxy-csr.json

8. 校验证书

openssl x509  -noout -text -in  kubernetes.pem

输出如下:

# openssl x509  -noout -text -in  kubernetes.pem
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number:
            44:cd:8c:e6:a4:60:ff:3f:09:af:02:e7:68:5e:f2:0f:e6:a0:39:fe
    Signature Algorithm: sha256WithRSAEncryption
        Issuer: C=CN, ST=ShenZhen, L=ShenZhen, O=k8s, OU=System, CN=kubernetes
        Validity
            Not Before: Dec 13 06:35:00 2019 GMT
            Not After : Nov 19 06:35:00 2119 GMT
        Subject: C=CN, ST=ShenZhen, L=ShenZhen, O=k8s, OU=System, CN=kubernetes
        Subject Public Key Info:
            Public Key Algorithm: rsaEncryption
                Public-Key: (2048 bit)
                Modulus:
                    00:d7:91:4f:90:56:fb:ab:a9:de:c4:98:9e:d7:e6:
                    45:db:5a:14:9a:76:78:6a:4c:db:3c:47:3c:e7:1c:
                    3c:37:4e:8a:cf:9c:a1:8a:4c:51:4c:cd:45:b0:03:
                    87:06:b9:20:2c:3a:51:f9:21:55:1c:90:7c:f8:93:
                    bc:6a:48:05:3d:8b:74:fd:f2:f1:e6:5e:ad:b4:a8:
                    f6:6d:f9:63:9e:e4:b4:cc:68:9e:90:d7:ef:de:ce:
                    c1:1d:1b:68:59:68:5e:5f:7d:5c:f3:49:4f:18:72:
                    be:b5:c8:af:e2:8d:34:9c:d2:68:b7:8c:26:69:cc:
                    a5:f4:ca:69:2d:d7:21:f5:19:2e:b2:b5:97:16:87:
                    9f:9c:fd:01:97:c2:0e:20:b4:88:27:9a:37:9a:af:
                    0a:cf:82:4f:26:24:cb:07:ac:8c:b1:34:20:42:22:
                    00:b2:b0:98:c5:53:01:fb:32:aa:15:1b:7e:39:44:
                    ae:af:6e:c3:65:96:f6:38:7a:87:37:d0:31:63:d8:
                    a4:15:13:f2:56:da:e6:09:45:2b:46:2c:cb:63:db:
                    f7:ba:7f:44:0a:36:39:7c:cc:5b:42:e5:56:c7:7f:
                    dd:64:5c:f2:4a:af:d3:a9:d1:6e:06:27:57:09:4d:
                    db:08:62:87:66:c8:2c:36:00:41:f1:90:f6:5f:68:
                    20:3d
                Exponent: 65537 (0x10001)
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature, Key Encipherment
            X509v3 Extended Key Usage:
                TLS Web Server Authentication, TLS Web Client Authentication
            X509v3 Basic Constraints: critical
                CA:FALSE
            X509v3 Subject Key Identifier:
                3D:3F:FA:B8:36:D7:FE:B1:59:BE:B1:F5:C1:5D:88:3D:BC:78:9F:87
            X509v3 Authority Key Identifier:
                keyid:40:A2:D4:30:22:12:2E:C2:FB:A2:55:2C:CB:F0:F6:3E:4D:B8:02:03

            X509v3 Subject Alternative Name:
                DNS:kubernetes, DNS:kubernetes.default, DNS:kubernetes.default.svc, DNS:kubernetes.default.svc.cluster, DNS:kubernetes.default.svc.cluster.local, IP Address:127.0.0.1, IP Address:172.20.0.112, IP Address:172.20.0.113, IP Address:172.20.0.114, IP Address:172.20.0.115, IP Address:10.254.0.1
    Signature Algorithm: sha256WithRSAEncryption
         63:50:f6:2a:03:c7:35:dd:e9:10:8d:2f:b3:27:9a:64:f3:e1:
         11:8a:18:1e:fa:6d:85:30:11:b4:59:a3:6c:86:cd:2b:5c:67:
         17:4f:aa:0d:bb:4c:ee:c8:af:e7:3d:61:6d:03:9d:14:6f:00:
         48:56:59:b5:76:13:a9:30:23:e0:b2:d2:12:64:0c:60:0d:76:
         ec:c6:4f:b1:bc:24:01:7a:48:c6:fd:9e:5d:68:da:b9:a1:ad:
         30:7a:ba:90:e2:e3:4e:b4:92:1b:c5:f2:8c:c1:b0:3d:fc:14:
         d2:46:e8:f8:22:8f:d9:4d:85:4f:58:6b:0f:84:78:06:b4:b9:
         92:b9:0d:bd:1d:95:e9:0d:42:d3:fd:dd:2a:59:60:3f:63:35:
         eb:07:25:d2:ea:0d:19:a6:f3:dc:92:8e:ee:73:04:15:5e:97:
         e8:da:51:c3:69:49:96:36:c7:cc:5b:e5:e5:cb:e5:ce:9f:21:
         6f:6b:56:16:bf:85:ad:1c:8c:91:c1:91:0a:90:18:e2:4a:b0:
         32:58:33:ef:55:8e:8f:4a:e3:0f:b8:f7:41:04:65:89:e1:1b:
         d8:41:28:6e:84:c3:1c:8e:a9:a0:8a:42:e4:fe:d7:fe:0e:24:
         dc:74:37:fa:5e:be:20:69:c5:9a:5a:e6:83:1c:0b:9e:e1:43:
         ef:4f:7a:37

字段说明:

  • 确认 Issuer 字段的内容和 ca-csr.json 一致;
  • 确认 Subject 字段的内容和 kubernetes-csr.json 一致;
  • 确认 X509v3 Subject Alternative Name 字段的内容和 kubernetes-csr.json 一致;
  • 确认 X509v3 Key Usage、Extended Key Usage 字段的内容和 ca-config.jsonkubernetes profile 一致;

9. 分发证书

将生成的证书和秘钥文件(后缀名为.pem)拷贝到所有机器的 /etc/kubernetes/ssl 目录下。

mkdir -p /etc/kubernetes/ssl
cp *.pem /etc/kubernetes/ssl

参考文章:

2.3 - k8s版本说明

1. k8s版本号说明

k8s维护最新三个版本的发布分支([2022.7.2]当前最新三个版本为1.24、1.23、1.22),Kubernetes 1.19 和更新的版本获得大约 1 年的补丁支持。

Kubernetes 版本表示为 x.y.z, 其中 x 是主要版本,y 是次要版本,z 是补丁版本。遵循语义化版本规范

2. 最新发行版本

1.24

最新发行版本:1.24.2 (发布日期: 2022-06-15)

不再支持:2023-09-29

补丁版本: 1.24.1、 1.24.2

Complete 1.24 Schedule and Changelog

Kubernetes 1.24 使用 go1.18构建,默认情况下将不再验证使用 SHA-1 哈希算法签名的证书。

2.1.1. 重要更新

1.24.0主要参考kubernetes/CHANGELOG-1.24.major-themes

1)kubelet完全移除Dockershim【最重大更新】

在 v1.20 中弃用后,dockershim 组件已从 kubelet 中删除。从 v1.24 开始,您将需要使用其他受支持的运行时之一(例如 containerd 或 CRI-O),或者如果您依赖 Docker 引擎作为容器运行时,则使用 cri-dockerd。有关确保您的集群已准备好进行此移除的更多信息,请参阅本[指南](Is Your Cluster Ready for v1.24? | Kubernetes)。

2)Beta API 默认关闭

默认情况下,不会在集群中启用新的 beta API。默认情况下,现有的 beta API 和现有 beta API 的新版本将继续启用。

3)存储容量和卷扩展到GA

存储容量跟踪支持通过 CSIStorageCapacity 对象公开当前可用的存储容量,并增强使用具有后期绑定的 CSI 卷的 pod 的调度。

卷扩展增加了对调整现有持久卷大小的支持。

4)避免 IP 分配给service的冲突

Kubernetes 1.24 引入了一项新的选择加入功能,允许您为服务的静态 IP 地址分配软预留范围。通过手动启用此功能,集群将更喜欢从服务 IP 地址池中自动分配,从而降低冲突风险。

可以分配 Service ClusterIP:

  • 动态,这意味着集群将自动在配置的服务 IP 范围内选择一个空闲 IP。

  • 静态,这意味着用户将在配置的服务 IP 范围内设置一个 IP。

Service ClusterIP 是唯一的,因此,尝试使用已分配的 ClusterIP 创建 Service 将返回错误。

2.1.2. 弃用更新

1)kubeadm

  • kubeadm.k8s.io/v1beta2 已被弃用,并将在未来的版本中删除,可能在 3 个版本(一年)中。您应该开始将 kubeadm.k8s.io/v1beta3 用于新集群。要迁移磁盘上的旧配置文件,您可以使用 kubeadm config migrate 命令。

  • 默认 k​​ubeadm 配置为 containerd 套接字(Unix:unix:///var/run/containerd/containerd.sock,Windows:npipe:////./pipe/containerd-containerd)而不是 Docker 的配置.如果在集群创建期间 Init|JoinConfiguration.nodeRegistration.criSocket 字段为空,并且在主机上发现多个套接字,则总是会抛出错误并要求用户通过设置字段中的值来指定要使用的套接字。使用 crictl 与 CRI 套接字进行所有通信,以执行诸如拉取图像和获取正在运行的容器列表等操作,而不是在 Docker 的情况下使用 docker CLI。

  • kubeadm 迁移到标签和污点中不再使用 master 一词。对于新的集群,标签 node-role.kubernetes.io/master 将不再添加到控制平面节点,只会添加标签 node-role.kubernetes.io/control-plane

2)kube-apiserver

  • 不安全的地址标志 --address、--insecure-bind-address、--port 和 --insecure-port(自 1.20 起惰性)被删除

  • 弃用了--master-countflag 和--endpoint-reconciler-type=master-countreconciler,转而使用 lease reconciler。

  • 已弃用Service.Spec.LoadBalancerIP。

3)kube-controller-manager

  • kube-controller-manager 中的不安全地址标志 --address 和 --port 自 v1.20 起无效,并在 v1.24 中被删除。

4)kubelet

  • --pod-infra-container-image kubelet 标志已弃用,将在未来版本中删除。

  • 以下与 dockershim 相关的标志也与 dockershim 一起被删除 --experimental-dockershim-root-directory、--docker-endpoint、--image-pull-progress-deadline、--network-plugin、--cni-conf -dir,--cni-bin-dir,--cni-cache-dir,--network-plugin-mtu。(#106907@cyclinder)

1.23

最新发行版本:1.23.8 (发布日期: 2022-06-15)

不再支持:2023-02-28

补丁版本: 1.23.1、 1.23.2、 1.23.3、 1.23.4、 1.23.5、 1.23.6、 1.23.7、 1.23.8

Complete 1.23 Schedule and Changelog

Kubernetes 是使用 golang 1.17 构建的。此版本的 go 删除了使用 GODEBUG=x509ignoreCN=0 环境设置来重新启用将 X.509 服务证书的 CommonName 视为主机名的已弃用旧行为的能力。

2.2.1. 重要更新

1)FlexVolume 已弃用

FlexVolume 已弃用。 Out-of-tree CSI 驱动程序是在 Kubernetes 中编写卷驱动程序的推荐方式。FlexVolume 驱动程序的维护者应实施 CSI 驱动程序并将 FlexVolume 的用户转移到 CSI。 FlexVolume 的用户应将其工作负载转移到 CSI 驱动程序。

2)IPv4/IPv6 双栈网络到 GA

IPv4/IPv6 双栈网络从 GA 毕业。从 1.21 开始,Kubernetes 集群默认启用支持双栈网络。在 1.23 中,移除了 IPv6DualStack 功能门。双栈网络的使用不是强制性的。尽管启用了集群以支持双栈网络,但 Pod 和服务继续默认为单栈。要使用双栈网络:Kubernetes 节点具有可路由的 IPv4/IPv6 网络接口,使用支持双栈的 CNI 网络插件,Pod 配置为双栈,服务的 .spec.ipFamilyPolicy 字段设置为 PreferDualStack 或需要双栈。

3)Horizo​​ntalPodAutoscaler v2 到 GA

Horizo​​ntalPodAutoscaler API 的第 2 版在 1.23 版本中逐渐稳定。 Horizo​​ntalPodAutoscaler autoscaling/v2beta2 API 已弃用,取而代之的是新的 autoscaling/v2 API,Kubernetes 项目建议将其用于所有用例。

4)Scheduler简化多点插件配置

kube-scheduler 正在为插件添加一个新的、简化的配置字段,以允许在一个位置启用多个扩展点。新的 multiPoint 插件字段旨在为管理员简化大多数调度程序设置。通过 multiPoint 启用的插件将自动为它们实现的每个单独的扩展点注册。例如,实现 Score 和 Filter 扩展的插件可以同时为两者启用。这意味着可以启用和禁用整个插件,而无需手动编辑单个扩展点设置。这些扩展点现在可以被抽象出来,因为它们与大多数用户无关。

2.2.2. 已知问题

在 1.22 Kubernetes 版本附带的 etcd v3.5.0 版本中发现了数据损坏问题。请阅读 etcd 的最新[生产建议](etcd/CHANGELOG at main · etcd-io/etcd · GitHub)。

运行etcd v3.5.2 v3.5.1和v3.5.0高负荷会导致数据损坏问题。如果etcd进程被杀,偶尔有些已提交的事务并不反映在所有的成员。建议升级到v3.5.3

最低推荐etcd版本运行在生产3.3.18 + 3.4.2 + v3.5.3 +。

1.22

最新发行版本:1.22.11 (发布日期: 2022-06-15)

不再支持:2022-10-28

补丁版本: 1.22.1、 1.22.2、 1.22.3、 1.22.4、 1.22.5、 1.22.6、 1.22.7、 1.22.8、 1.22.9、 1.22.10、 1.22.11

Complete 1.22 Schedule and Changelog

2.3.1. 重要更新

1)kubeadm

  • 允许非root用户允许kubeadm。

  • 现在V1beta3首选API版本;v1beta2 API也仍然是可用的,并没有弃用。

  • 移除对docker cgroup driver的检查,kubeadm默认使用systemd cgroup driver,需要手动将runtime配置为systemd。

  • v1beta3中删除ClusterConfiguration.DNS字段,因为CoreDNS是唯一支持DNS类型。

2)etcd

  • etcd使用v3.5.0版本。(但是在1.23版本中发现v3.5.0有数据损坏的问题)

3)kubelet

  • 节点支持swap内存。

  • 作为α特性,Kubernetes v1.22并且可以使用cgroup v2 API来控制内存分配和隔离。这个功能的目的是改善工作负载和节点可用性时对内存资源的争用。

3. 版本偏差策略

3.1. 支持的版本偏差

总结:

  • kubelet 版本不能比 kube-apiserver 版本新,最多只可落后两个次要版本。

  • kube-controller-managerkube-scheduler 和 cloud-controller-manager 不能比 kube-apiserver 版本新。最多落后一个次要版本(允许实时升级)。

  • kubectl 在 kube-apiserver 的一个次要版本(较旧或较新)中支持。

  • kube-proxy 和节点上的 kubelet 必须是相同的次要版本。

1)kube-apiserver

高可用性(HA)集群中, 最新版和最老版的 kube-apiserver 实例版本偏差最多为一个次要版本。

例如:

  • 最新的 kube-apiserver 实例处于 1.24 版本
  • 其他 kube-apiserver 实例支持 1.24 和 1.23 版本

2)kubelet

kubelet 版本不能比 kube-apiserver 版本新,并且最多只可落后两个次要版本。

例如:

  • kube-apiserver 处于 1.24 版本
  • kubelet 支持 1.241.23 和 1.22 版本

说明:

如果 HA 集群中的 kube-apiserver 实例之间存在版本偏差,这会缩小允许的 kubelet 版本范围。

例如:

  • kube-apiserver 实例处于 1.24 和 1.23 版本
  • kubelet 支持 1.23 和 1.22 版本, (不支持 1.24 版本,因为这将比 kube-apiserver 1.23 版本的实例新)

3)kube-controller-manager、kube-scheduler 和 cloud-controller-manager

kube-controller-managerkube-scheduler 和 cloud-controller-manager 不能比与它们通信的 kube-apiserver 实例新。 它们应该与 kube-apiserver 次要版本相匹配,但可能最多旧一个次要版本(允许实时升级)。

例如:

  • kube-apiserver 处于 1.24 版本
  • kube-controller-managerkube-scheduler 和 cloud-controller-manager 支持 1.24 和 1.23 版本

说明:

如果 HA 集群中的 kube-apiserver 实例之间存在版本偏差, 并且这些组件可以与集群中的任何 kube-apiserver 实例通信(例如,通过负载均衡器),这会缩小这些组件所允许的版本范围。

例如:

  • kube-apiserver 实例处于 1.24 和 1.23 版本
  • kube-controller-managerkube-scheduler 和 cloud-controller-manager 与可以路由到任何 kube-apiserver 实例的负载均衡器通信
  • kube-controller-managerkube-scheduler 和 cloud-controller-manager 支持 1.23 版本(不支持 1.24 版本,因为它比 1.23 版本的 kube-apiserver 实例新)

4)kubectl

kubectl 在 kube-apiserver 的一个次要版本(较旧或较新)中支持。

例如:

  • kube-apiserver 处于 1.24 版本
  • kubectl 支持 1.251.24 和 1.23 版本

说明:

如果 HA 集群中的 kube-apiserver 实例之间存在版本偏差,这会缩小支持的 kubectl 版本范围。

例如:

  • kube-apiserver 实例处于 1.24 和 1.23 版本
  • kubectl 支持 1.24 和 1.23 版本(其他版本将与 kube-apiserver 组件之一相差不止一个的次要版本)

5)kube-proxy

  • kube-proxy 和节点上的 kubelet 必须是相同的次要版本。
  • kube-proxy 版本不能比 kube-apiserver 版本新。
  • kube-proxy 最多只能比 kube-apiserver 落后两个次要版本。

例如:

如果 kube-proxy 版本处于 1.22 版本:

  • kubelet 必须处于相同的次要版本 1.22
  • kube-apiserver 版本必须介于 1.22 和 1.24 之间,包括两者。

3.2. 组件升级顺序

优先升级kube-apiserver,其他的组件按照上述的版本要求进行升级,最好保持一致的版本。

4. k8s版本发布周期

k8s每年大概发布三次,即3-4个月发布一次大版本(发布版本为 vX.Y 里程碑创建的 Git 分支 release-X.Y)。

发布过程可被认为具有三个主要阶段:

  • 特性增强定义
  • 实现
  • 稳定

4.1. 发布周期

1)正常开发(第 1-11 周)

  • /sig {name}

  • /sig {name}

  • /kind {type}

  • /lgtm

  • /approved

2)代码冻结(第 12-14 周)

  • /milestone {v1.y}
  • /sig {name}
  • /kind {bug, failing-test}
  • /lgtm
  • /approved

3)发布后(第 14 周以上)

回到“正常开发”阶段要求:

  • /sig {name}
  • /kind {type}
  • /lgtm
  • /approved

参考:

3 - 基本概念

3.1 - kubernetes架构

3.1.1 - Kubernetes总架构图

1. Kubernetes的总架构图

2. Kubernetes各个组件介绍

2.1 kube-master[控制节点]

master的工作流程图

  1. Kubecfg将特定的请求,比如创建Pod,发送给Kubernetes Client。
  2. Kubernetes Client将请求发送给API server。
  3. API Server根据请求的类型,比如创建Pod时storage类型是pods,然后依此选择何种REST Storage API对请求作出处理。
  4. REST Storage API对的请求作相应的处理。
  5. 将处理的结果存入高可用键值存储系统Etcd中。
  6. 在API Server响应Kubecfg的请求后,Scheduler会根据Kubernetes Client获取集群中运行Pod及Minion/Node信息。
  7. 依据从Kubernetes Client获取的信息,Scheduler将未分发的Pod分发到可用的Minion/Node节点上。

2.1.1 API Server[资源操作入口]

  1. 提供了资源对象的唯一操作入口,其他所有组件都必须通过它提供的API来操作资源数据,只有API Server与存储通信,其他模块通过API Server访问集群状态。

    第一,是为了保证集群状态访问的安全。

    第二,是为了隔离集群状态访问的方式和后端存储实现的方式:API Server是状态访问的方式,不会因为后端存储技术etcd的改变而改变。

  2. 作为kubernetes系统的入口,封装了核心对象的增删改查操作,以RESTFul接口方式提供给外部客户和内部组件调用。对相关的资源数据“全量查询”+“变化监听”,实时完成相关的业务功能。

更多API Server信息请参考:Kubernetes核心原理(一)之API Server

2.1.2 Controller Manager[内部管理控制中心]

  1. 实现集群故障检测和恢复的自动化工作,负责执行各种控制器,主要有:
    • endpoint-controller:定期关联service和pod(关联信息由endpoint对象维护),保证service到pod的映射总是最新的。
    • replication-controller:定期关联replicationController和pod,保证replicationController定义的复制数量与实际运行pod的数量总是一致的。

更多Controller Manager信息请参考:Kubernetes核心原理(二)之Controller Manager

2.1.3 Scheduler[集群分发调度器]

  1. Scheduler收集和分析当前Kubernetes集群中所有Minion/Node节点的资源(内存、CPU)负载情况,然后依此分发新建的Pod到Kubernetes集群中可用的节点。
  2. 实时监测Kubernetes集群中未分发和已分发的所有运行的Pod。
  3. Scheduler也监测Minion/Node节点信息,由于会频繁查找Minion/Node节点,Scheduler会缓存一份最新的信息在本地。
  4. 最后,Scheduler在分发Pod到指定的Minion/Node节点后,会把Pod相关的信息Binding写回API Server。

更多Scheduler信息请参考:Kubernetes核心原理(三)之Scheduler

2.2 kube-node[服务节点]

kubelet结构图

2.2.1 Kubelet[节点上的Pod管家]

  1. 负责Node节点上pod的创建、修改、监控、删除等全生命周期的管理

  2. 定时上报本Node的状态信息给API Server。

  3. kubelet是Master API Server和Minion/Node之间的桥梁,接收Master API Server分配给它的commands和work,通过kube-apiserver间接与Etcd集群交互,读取配置信息。

  4. 具体的工作如下:

    1. 设置容器的环境变量、给容器绑定Volume、给容器绑定Port、根据指定的Pod运行一个单一容器、给指定的Pod创建network 容器。

    2. 同步Pod的状态、同步Pod的状态、从cAdvisor获取container info、 pod info、 root info、 machine info。

    3. 在容器中运行命令、杀死容器、删除Pod的所有容器。

更多Kubelet信息请参考:Kubernetes核心原理(四)之Kubelet

2.2.2 Proxy[负载均衡、路由转发]

  1. Proxy是为了解决外部网络能够访问跨机器集群中容器提供的应用服务而设计的,运行在每个Minion/Node上。Proxy提供TCP/UDP sockets的proxy,每创建一种Service,Proxy主要从etcd获取Services和Endpoints的配置信息(也可以从file获取),然后根据配置信息在Minion/Node上启动一个Proxy的进程并监听相应的服务端口,当外部请求发生时,Proxy会根据Load Balancer将请求分发到后端正确的容器处理。
  2. Proxy不但解决了同一主宿机相同服务端口冲突的问题,还提供了Service转发服务端口对外提供服务的能力,Proxy后端使用了随机、轮循负载均衡算法。

2.2.3 kubectl[集群管理命令行工具集]

  1. 通过客户端的kubectl命令集操作,API Server响应对应的命令结果,从而达到对kubernetes集群的管理。

参考文章:

https://yq.aliyun.com/articles/47308?spm=5176.100240.searchblog.19.jF7FFa

3.1.2 - 基于Docker及Kubernetes技术构建容器云(PaaS)平台

[编者的话]

目前很多的容器云平台通过Docker及Kubernetes等技术提供应用运行平台,从而实现运维自动化,快速部署应用、弹性伸缩和动态调整应用环境资源,提高研发运营效率。

从宏观到微观(从抽象到具体)的思路来理解:云计算→PaaS→ App Engine→XAE[XXX App Engine] (XAE泛指一类应用运行平台,例如GAE、SAE、BAE等)。

本文简要介绍了与容器云相关的几个重要概念:PaaS、App Engine、Dokcer、Kubernetes。

1. PaaS概述

1.1. PaaS概念

  1. PaaS(Platform as a service),平台即服务,指将软件研发的平台(或业务基础平台)作为一种服务,以SaaS的模式提交给用户。
  2. PaaS是云计算服务的其中一种模式,云计算是一种按使用量付费的模式的服务,类似一种租赁服务,服务可以是基础设施计算资源(IaaS),平台(PaaS),软件(SaaS)。租用IT资源的方式来实现业务需要,如同水力、电力资源一样,计算、存储、网络将成为企业IT运行的一种被使用的资源,无需自己建设,可按需获得。
  3. PaaS的实质是将互联网的资源服务化为可编程接口,为第三方开发者提供有商业价值的资源和服务平台。简而言之,IaaS就是卖硬件及计算资源,PaaS就是卖开发、运行环境,SaaS就是卖软件

1.2. IaaS/PaaS/SaaS说明

类型 说明 比喻 例子
IaaS:Infrastructure-as-a-Service(基础设施即服务) 提供的服务是计算基础设施 地皮,需要自己盖房子 Amazon EC2(亚马逊弹性云计算)
PaaS: Platform-as-a-Service(平台即服务) 提供的服务是软件研发的平台或业务基础平台 商品房,需要自己装修 GAE(谷歌开发者平台)
SaaS: Software-as-a-Service(软件即服务) 提供的服务是运行在云计算基础设施上的应用程序 酒店套房,可以直接入住 谷歌的Gmail邮箱

1.3. PaaS的特点(三种层次)

特点 说明
平台即服务 PaaS提供的服务就是个基础平台,一个环境,而不是具体的应用
平台及服务 不仅提供平台,还提供对该平台的技术支持、优化等服务
平台级服务 “平台级服务”即强大稳定的平台和专业的技术支持团队,保障应用的稳定使用

2. App Engine概述

2.1. App Engine概念

App Engine是PaaS模式的一种实现方式,App Engine将应用运行所需的 IT 资源和基础设施以服务的方式提供给用户,包括了中间件服务、资源管理服务、弹性调度服务、消息服务等多种服务形式。App Engine的目标是对应用提供完整生命周期(包括设计、开发、测试和部署等阶段)的支持,从而减少了用户在购置和管理应用生命周期内所必须的软硬件以及部署应用和IT 基础设施的成本,同时简化了以上工作的复杂度。常见的App Engine有:GAE(Google App Engine),SAE(Sina App Engine),BAE(Baidu App Engine)。

App Engine利用虚拟化与自动化技术实现快速搭建部署应用运行环境和动态调整应用运行时环境资源这两个目标。一方面实现即时部署以及快速回收,降低了环境搭建时间,避免了手工配置错误,快速重复搭建环境,及时回收资源, 减少了低利用率硬件资源的空置。另一方面,根据应用运行时的需求对应用环境进行动态调整,实现了应用平台的弹性扩展和自优化,减少了非高峰时硬件资源的空置。

简而言之,App Engine主要目标是:Easy to maintain(维护), Easy to scale(扩容), Easy to build(构建)

2.2. 架构设计

2.3. 组成模块说明

组成模块 模块说明
App Router[流量接入层] 接收用户请求,并转发到不同的App Runtime。
App Runtime[应用运行层] 应用运行环境,为各个应用提供基本的运行引擎,从而让app能够运行起来。
Services[基础服务层] 各个通用基础服务,主要是对主流的服务提供通用的接入,例如数据库等。
Platform Control[平台控制层] 整个平台的控制中心,实现业务调度,弹性扩容、资源审计、集群管理等相关工作。
Manage System[管理界面层] 提供友好可用的管理操作界面方便平台管理员来控制管理整个平台。
Platform Support[平台支持层] 为应用提供相关的支持,比如应用监控、问题定位、分布式日志重建、统计分析等。
Log Center[日志中心] 实时收集相关应用及系统的日志(日志收集),提供实时计算和分析平台(日志处理)。
Code Center[代码中心] 完成代码存储、部署上线相关的工作。

3. 容器云平台技术栈

功能组成部分 使用工具
应用载体 Docker
编排工具 Kubernetes
配置数据 Etcd
网络管理 Flannel
存储管理 Ceph
底层实现 Linux内核的Namespace[资源隔离]和CGroups[资源控制]
  • Namespace[资源隔离] Namespaces机制提供一种资源隔离方案。PID,IPC,Network等系统资源不再是全局性的,而是属于某个特定的Namespace。每个namespace下的资源对于其他namespace下的资源都是透明,不可见的。
  • CGroups[资源控制] CGroup(control group)是将任意进程进行分组化管理的Linux内核功能。CGroup本身是提供将进程进行分组化管理的功能和接口的基础结构,I/O或内存的分配控制等具体的资源管理功能是通过这个功能来实现的。CGroups可以限制、记录、隔离进程组所使用的物理资源(包括:CPU、memory、IO等),为容器实现虚拟化提供了基本保证。CGroups本质是内核附加在程序上的一系列钩子(hooks),通过程序运行时对资源的调度触发相应的钩子以达到资源追踪和限制的目的。

4. Docker概述

更多详情请参考:Docker整体架构图

4.1. Docker介绍

  1. Docker - Build, Ship, and Run Any App, Anywhere
  2. Docker是一种Linux容器工具集,它是为“构建(Build)、交付(Ship)和运行(Run)”分布式应用而设计的。
  3. Docker相当于把应用以及应用所依赖的环境完完整整地打成了一个包,这个包拿到哪里都能原生运行。因此可以在开发、测试、运维中保证环境的一致性。
  4. Docker的本质:Docker=LXC(Namespace+CGroups)+Docker Images,即在Linux内核的Namespace[资源隔离]和CGroups[资源控制]技术的基础上通过镜像管理机制来实现轻量化设计。

4.2. Docker的基本概念

4.2.1. 镜像

Docker 镜像就是一个只读的模板,可以把镜像理解成一个模子(模具),由模子(镜像)制作的成品(容器)都是一样的(除非在生成时加额外参数),修改成品(容器)本身并不会对模子(镜像)产生影响(除非将成品提交成一个模子),容器重启时,即由模子(镜像)重新制作成一个成品(容器),与其他由该模子制作成的成品并无区别。

例如:一个镜像可以包含一个完整的 ubuntu 操作系统环境,里面仅安装了 Apache 或用户需要的其它应用程序。镜像可以用来创建 Docker 容器。Docker 提供了一个很简单的机制来创建镜像或者更新现有的镜像,用户可以直接从其他人那里下载一个已经做好的镜像来直接使用。

4.2.2. 容器

Docker 利用容器来运行应用。容器是从镜像创建的运行实例。它可以被启动、开始、停止、删除。每个容器都是相互隔离的、保证安全的平台。可以把容器看做是一个简易版的 Linux 环境(包括root用户权限、进程空间、用户空间和网络空间等)和运行在其中的应用程序。

4.2.3. 仓库

仓库是集中存放镜像文件的场所。有时候会把仓库和仓库注册服务器(Registry)混为一谈,并不严格区分。实际上,仓库注册服务器上往往存放着多个仓库,每个仓库中又包含了多个镜像,每个镜像有不同的标签(tag)。

4.3. Docker的优势

  1. 容器的快速轻量

    容器的启动,停止和销毁都是以秒或毫秒为单位的,并且相比传统的虚拟化技术,使用容器在CPU、内存,网络IO等资源上的性能损耗都有同样水平甚至更优的表现。

  2. 一次构建,到处运行

    当将容器固化成镜像后,就可以非常快速地加载到任何环境中部署运行。而构建出来的镜像打包了应用运行所需的程序、依赖和运行环境, 这是一个完整可用的应用集装箱,在任何环境下都能保证环境一致性。

  3. 完整的生态链

    容器技术并不是Docker首创,但是以往的容器实现只关注于如何运行,而Docker站在巨人的肩膀上进行整合和创新,特别是Docker镜像的设计,完美地解决了容器从构建、交付到运行,提供了完整的生态链支持。

5. Kubernetes概述

更多详情请参考:Kubernetes总架构图

5.1. Kubernetes介绍

Kubernetes是Google开源的容器集群管理系统。它构建Docker技术之上,为容器化的应用提供资源调度、部署运行、服务发现、扩容缩容等整一套功能,本质上可看作是基于容器技术的Micro-PaaS平台,即第三代PaaS的代表性项目。

5.2. Kubernetes的基本概念

5.2.1. Pod

Pod是若干个相关容器的组合,是一个逻辑概念,Pod包含的容器运行在同一个宿主机上,这些容器使用相同的网络命名空间、IP地址和端口,相互之间能通过localhost来发现和通信,共享一块存储卷空间。在Kubernetes中创建、调度和管理的最小单位是Pod。一个Pod一般只放一个业务容器和一个用于统一网络管理的网络容器。

5.2.2. Replication Controller

Replication Controller是用来控制管理Pod副本(Replica,或者称实例),Replication Controller确保任何时候Kubernetes集群中有指定数量的Pod副本在运行,如果少于指定数量的Pod副本,Replication Controller会启动新的Pod副本,反之会杀死多余的以保证数量不变。另外Replication Controller是弹性伸缩、滚动升级的实现核心。

5.2.3. Service

Service是真实应用服务的抽象,定义了Pod的逻辑集合和访问这个Pod集合的策略,Service将代理Pod对外表现为一个单一访问接口,外部不需要了解后端Pod如何运行,这给扩展或维护带来很大的好处,提供了一套简化的服务代理和发现机制。

5.2.4. Label

Label是用于区分Pod、Service、Replication Controller的Key/Value键值对,实际上Kubernetes中的任意API对象都可以通过Label进行标识。每个API对象可以有多个Label,但是每个Label的Key只能对应一个Value。Label是Service和Replication Controller运行的基础,它们都通过Label来关联Pod,相比于强绑定模型,这是一种非常好的松耦合关系。

5.2.5. Node

Kubernets属于主从的分布式集群架构,Kubernets Node(简称为Node,早期版本叫做Minion)运行并管理容器。Node作为Kubernetes的操作单元,将用来分配给Pod(或者说容器)进行绑定,Pod最终运行在Node上,Node可以认为是Pod的宿主机。

5.3. Kubernetes架构

3.2 - kubernetes对象

3.2.1 - Kubernetes基本概念

1. Master

集群的控制节点,负责整个集群的管理和控制,kubernetes的所有的命令基本都是发给Master,由它来负责具体的执行过程。

1.1. Master的组件

  • kube-apiserver:资源增删改查的入口
  • kube-controller-manager:资源对象的大总管
  • kube-scheduler:负责资源调度(Pod调度)
  • etcd Server:kubernetes的所有的资源对象的数据保存在etcd中。

2. Node

Node是集群的工作负载节点,默认情况kubelet会向Master注册自己,一旦Node被纳入集群管理范围,kubelet会定时向Master汇报自身的情报,包括操作系统,Docker版本,机器资源情况等。

如果Node超过指定时间不上报信息,会被Master判断为“失联”,标记为Not Ready,随后Master会触发Pod转移。

2.1. Node的组件

  • kubelet:Pod的管家,与Master通信
  • kube-proxy:实现kubernetes Service的通信与负载均衡机制的重要组件
  • Docker:容器的创建和管理

2.2. Node相关命令

kubectl get nodes

kuebctl describe node {node_name}

2.3. describe命令的Node信息

  • Node基本信息:名称、标签、创建时间等
  • Node当前的状态,Node启动后会进行自检工作,磁盘是否满,内存是否不足,若都正常则切换为Ready状态。
  • Node的主机地址与主机名
  • Node上的资源总量:CPU,内存,最大可调度Pod数量等
  • Node可分配资源量:当前Node可用于分配的资源量
  • 主机系统信息:主机唯一标识符UUID,Linux kernel版本号,操作系统,kubernetes版本,kubelet与kube-proxy版本
  • 当前正在运行的Pod列表及概要信息
  • 已分配的资源使用概要,例如资源申请的最低、最大允许使用量占系统总量的百分比
  • Node相关的Event信息。

3. Pod

Pod是Kubernetes中操作的基本单元。每个Pod中有个根容器(Pause容器),Pause容器的状态代表整个容器组的状态,其他业务容器共享Pause的IP,即Pod IP,共享Pause挂载的Volume,这样简化了同个Pod中不同容器之间的网络问题和文件共享问题。

pod

  1. Kubernetes集群中,同宿主机的或不同宿主机的Pod之间要求能够TCP/IP直接通信,因此采用虚拟二层网络技术来实现,例如Flannel,Openvswitch(OVS)等,这样在同个集群中,不同的宿主机的Pod IP为不同IP段的IP,集群中的所有Pod IP都是唯一的,不同Pod之间可以直接通信。
  2. Pod有两种类型:普通Pod和静态Pod。静态Pod即不通过K8S调度和创建,直接在某个具体的Node机器上通过具体的文件来启动。普通Pod则是由K8S创建、调度,同时数据存放在ETCD中。
  3. Pod IP和具体的容器端口(ContainnerPort)组成一个具体的通信地址,即Endpoint。一个Pod中可以存在多个容器,可以有多个端口,Pod IP一样,即有多个Endpoint。
  4. Pod Volume是定义在Pod之上,被各个容器挂载到自己的文件系统中,可以用分布式文件系统实现后端存储功能。
  5. Pod中的Event事件可以用来排查问题,可以通过kubectl describe pod xxx 来查看对应的事件。
  6. 每个Pod可以对其能使用的服务器上的计算资源设置限额,一般为CPU和Memory。K8S中一般将千分之一个的CPU配置作为最小单位,用m表示,是一个绝对值,即100m对于一个Core的机器还是48个Core的机器都是一样的大小。Memory配额也是个绝对值,单位为内存字节数。
  7. 资源配额的两个参数
  • Requests:该资源的最小申请量,系统必须满足要求。
  • Limits:该资源最大允许使用量,当超过该量,K8S会kill并重启Pod。

pod2

4. Label

  1. Label是一个键值对,可以附加在任何对象上,比如Node,Pod,Service,RC等。Label和资源对象是多对多的关系,即一个Label可以被添加到多个对象上,一个对象也可以定义多个Label。
  2. Label的作用主要用来实现精细的、多维度的资源分组管理,以便进行资源分配,调度,配置,部署等工作。
  3. Label通俗理解就是“标签”,通过标签来过滤筛选指定的对象,进行具体的操作。k8s通过Label Selector(标签选择器)来筛选指定Label的资源对象,类似SQL语句中的条件查询(WHERE语句)。
  4. Label Selector有基于等式和基于集合的两种表达方式,可以多个条件进行组合使用。
  • 基于等式:name=redis-slave(匹配name=redis-slave的资源对象);env!=product(匹配所有不具有标签env=product的资源对象)
  • 基于集合:name in (redis-slave,redis-master);name not in (php-frontend)(匹配所有不具有标签name=php-frontend的资源对象)

使用场景

  1. kube-controller进程通过资源对象RC上定义的Label Selector来筛选要监控的Pod副本数,从而实现副本数始终保持预期数目。
  2. kube-proxy进程通过Service的Label Selector来选择对应Pod,自动建立每个Service到对应Pod的请求转发路由表,从而实现Service的智能负载均衡机制。
  3. kube-scheduler实现Pod定向调度:对Node定义特定的Label,并且在Pod定义文件中使用NodeSelector标签调度策略。

5. Replication Controller(RC)

RC是k8s系统中的核心概念,定义了一个期望的场景。

主要包括:

  • Pod期望的副本数(replicas)
  • 用于筛选目标Pod的Label Selector
  • 用于创建Pod的模板(template)

RC特性说明:

  1. Pod的缩放可以通过以下命令实现:kubectl scale rc redis-slave --replicas=3
  2. 删除RC并不会删除该RC创建的Pod,可以将副本数设置为0,即可删除对应Pod。或者通过kubectl stop /delete命令来一次性删除RC和其创建的Pod。
  3. 改变RC中Pod模板的镜像版本可以实现滚动升级(Rolling Update)。具体操作见https://kubernetes.io/docs/tasks/run-application/rolling-update-replication-controller/
  4. Kubernetes1.2以上版本将RC升级为Replica Set,它与当前RC的唯一区别在于Replica Set支持基于集合的Label Selector(Set-based selector),而旧版本RC只支持基于等式的Label Selector(equality-based selector)。
  5. Kubernetes1.2以上版本通过Deployment来维护Replica Set而不是单独使用Replica Set。即控制流为:Delpoyment→Replica Set→Pod。即新版本的Deployment+Replica Set替代了RC的作用。

6. Deployment

Deployment是kubernetes 1.2引入的概念,用来解决Pod的编排问题。Deployment可以理解为RC的升级版(RC+Reolicat Set)。特点在于可以随时知道Pod的部署进度,即对Pod的创建、调度、绑定节点、启动容器完整过程的进度展示。

使用场景

  1. 创建一个Deployment对象来生成对应的Replica Set并完成Pod副本的创建过程。
  2. 检查Deployment的状态来确认部署动作是否完成(Pod副本的数量是否达到预期值)。
  3. 更新Deployment以创建新的Pod(例如镜像升级的场景)。
  4. 如果当前Deployment不稳定,回退到上一个Deployment版本。
  5. 挂起或恢复一个Deployment。

可以通过kubectl describe deployment来查看Deployment控制的Pod的水平拓展过程。

7. Horizontal Pod Autoscaler(HPA)

Horizontal Pod Autoscaler(HPA)即Pod横向自动扩容,与RC一样也属于k8s的资源对象。

HPA原理:通过追踪分析RC控制的所有目标Pod的负载变化情况,来确定是否针对性调整Pod的副本数。

Pod负载度量指标:

  • CPUUtilizationPercentage:Pod所有副本自身的CPU利用率的平均值。即当前Pod的CPU使用量除以Pod Request的值。
  • 应用自定义的度量指标,比如服务每秒内响应的请求数(TPS/QPS)。

8. Service(服务)

8.1. Service概述

service

Service定义了一个服务的访问入口地址,前端应用通过这个入口地址访问其背后的一组由Pod副本组成的集群实例,Service与其后端的Pod副本集群之间是通过Label Selector来实现“无缝对接”。RC保证Service的Pod副本实例数目保持预期水平。

8.2. kubernetes的服务发现机制

主要通过kube-dns这个组件来进行DNS方式的服务发现。

8.3. 外部系统访问Service的问题

IP类型 说明
Node IP Node节点的IP地址
Pod IP Pod的IP地址
Cluster IP Service的IP地址

8.3.1. Node IP

NodeIP是集群中每个节点的物理网卡IP地址,是真实存在的物理网络,kubernetes集群之外的节点访问kubernetes内的某个节点或TCP/IP服务的时候,需要通过NodeIP进行通信。

8.3.2. Pod IP

Pod IP是每个Pod的IP地址,是Docker Engine根据docker0网桥的IP段地址进行分配的,是一个虚拟二层网络,集群中一个Pod的容器访问另一个Pod中的容器,是通过Pod IP进行通信的,而真实的TCP/IP流量是通过Node IP所在的网卡流出的。

8.3.3. Cluster IP

  1. Service的Cluster IP是一个虚拟IP,只作用于Service这个对象,由kubernetes管理和分配IP地址(来源于Cluster IP地址池)。
  2. Cluster IP无法被ping通,因为没有一个实体网络对象来响应。
  3. Cluster IP结合Service Port组成的具体通信端口才具备TCP/IP通信基础,属于kubernetes集群内,集群外访问该IP和端口需要额外处理。
  4. k8s集群内Node IP 、Pod IP、Cluster IP之间的通信采取k8s自己的特殊的路由规则,与传统IP路由不同。

8.3.4. 外部访问Kubernetes集群

通过宿主机与容器端口映射的方式进行访问,例如:Service定位文件如下:

可以通过任意Node的IP 加端口访问该服务。也可以通过Nginx或HAProxy来设置负载均衡。

9. Volume(存储卷)

9.1. Volume的功能

  1. Volume是Pod中能够被多个容器访问的共享目录,可以让容器的数据写到宿主机上或者写文件到网络存储中
  2. 可以实现容器配置文件集中化定义与管理,通过ConfigMap资源对象来实现。

9.2. Volume的特点

k8s中的Volume与Docker的Volume相似,但不完全相同。

  1. k8s上Volume定义在Pod上,然后被一个Pod中的多个容器挂载到具体的文件目录下。
  2. k8s的Volume与Pod生命周期相关而不是容器是生命周期,即容器挂掉,数据不会丢失但是Pod挂掉,数据则会丢失。
  3. k8s中的Volume支持多种类型的Volume:Ceph、GlusterFS等分布式系统。

9.3. Volume的使用方式

先在Pod上声明一个Volume,然后容器引用该Volume并Mount到容器的某个目录。

9.4. Volume类型

9.4.1. emptyDir

emptyDir Volume是在Pod分配到Node时创建的,初始内容为空,无须指定宿主机上对应的目录文件,由K8S自动分配一个目录,当Pod被删除时,对应的emptyDir数据也会永久删除。

作用

  1. 临时空间,例如程序的临时文件,无须永久保留
  2. 长时间任务的中间过程CheckPoint的临时保存目录
  3. 一个容器需要从另一个容器中获取数据的目录(即多容器共享目录)

说明

目前用户无法设置emptyVolume的使用介质,如果kubelet的配置使用硬盘则emptyDir将创建在该硬盘上。

9.4.2. hostPath

hostPath是在Pod上挂载宿主机上的文件或目录。

作用

  1. 容器应用日志需要持久化时,可以使用宿主机的高速文件系统进行存储
  2. 需要访问宿主机上Docker引擎内部数据结构的容器应用时,可以通过定义hostPath为宿主机/var/lib/docker目录,使容器内部应用可以直接访问Docker的文件系统。

注意点:

  1. 在不同的Node上具有相同配置的Pod可能会因为宿主机上的目录或文件不同导致对Volume上目录或文件的访问结果不一致。
  2. 如果使用了资源配额管理,则kubernetes无法将hostPath在宿主机上使用的资源纳入管理。

9.4.3. gcePersistentDisk

表示使用谷歌公有云提供的永久磁盘(Persistent Disk ,PD)存放Volume的数据,它与EmptyDir不同,PD上的内容会被永久保存。当Pod被删除时,PD只是被卸载时,但不会被删除。需要先创建一个永久磁盘,才能使用gcePersistentDisk。

使用gcePersistentDisk的限制条件:

  • Node(运行kubelet的节点)需要是GCE虚拟机。
  • 虚拟机需要与PD存在于相同的GCE项目中和Zone中。

10. Persistent Volume

Volume定义在Pod上,属于“计算资源”的一部分,而Persistent Volume和Persistent Volume Claim是网络存储,简称PV和PVC,可以理解为k8s集群中某个网络存储中对应的一块存储。

  • PV是网络存储,不属于任何Node,但可以在每个Node上访问。
  • PV不是定义在Pod上,而是独立于Pod之外定义。
  • PV常见类型:GCE Persistent Disks、NFS、RBD等。

PV是有状态的对象,状态类型如下:

  • Available:空闲状态
  • Bound:已经绑定到某个PVC上
  • Released:对应的PVC已经删除,但资源还没有回收
  • Failed:PV自动回收失败

11. Namespace

Namespace即命名空间,主要用于多租户的资源隔离,通过将资源对象分配到不同的Namespace上,便于不同的分组在共享资源的同时可以被分别管理。

k8s集群启动后会默认创建一个“default”的Namespace。可以通过kubectl get namespaecs查看。

可以通过kubectl config use-context namespace配置当前k8s客户端的环境,通过kubectl get pods获取当前namespace的Pod。或者通过kubectl get pods --namespace=NAMESPACE来获取指定namespace的Pod。

Namespace yaml文件的定义

12. Annotation(注解)

Annotation与Label类似,也使用key/value的形式进行定义,Label定义元数据(Metadata),Annotation定义“附加”信息。

通常Annotation记录信息如下:

  • build信息,release信息,Docker镜像信息等。
  • 日志库、监控库等。

参考《Kubernetes权威指南》

3.2.2 - 理解kubernetes对象

1. kubernetes对象概述

kubernetes中的对象是一些持久化的实体,可以理解为是对集群状态的描述或期望

包括:

  • 集群中哪些node上运行了哪些容器化应用
  • 应用的资源是否满足使用
  • 应用的执行策略,例如重启策略、更新策略、容错策略等。

kubernetes的对象是一种意图(期望)的记录,kubernetes会始终保持预期创建的对象存在和集群运行在预期的状态下

操作kubernetes对象(增删改查)需要通过kubernetes API,一般有以下几种方式:

  • kubectl命令工具
  • Client library的方式,例如 client-go

2. Spec and Status

每个kubernetes对象的结构描述都包含specstatus两个部分。

  • spec:该内容由用户提供,描述用户期望的对象特征及集群状态。
  • status:该内容由kubernetes集群提供和更新,描述kubernetes对象的实时状态。

任何时候,kubernetes都会控制集群的实时状态status与用户的预期状态spec一致。

例如:当你定义Deployment的描述文件,指定集群中运行3个实例,那么kubernetes会始终保持集群中运行3个实例,如果任何实例挂掉,kubernetes会自动重建新的实例来保持集群中始终运行用户预期的3个实例。

3. 对象描述文件

当你要创建一个kubernetes对象的时候,需要提供该对象的描述信息spec,来描述你的对象在kubernetes中的预期状态。

一般使用kubernetes API来创建kubernetes对象,其中spec信息可以以JSON的形式存放在request body中,也可以以.yaml文件的形式通过kubectl工具创建。

例如,以下为Deployment对象对应的yaml文件:

apiVersion: apps/v1beta2 # for versions before 1.8.0 use apps/v1beta1
kind: Deployment
metadata:
  name: nginx-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:1.7.9
        ports:
        - containerPort: 80

执行kubectl create的命令

#create command
kubectl create -f https://k8s.io/docs/user-guide/nginx-deployment.yaml --record
#output
deployment "nginx-deployment" created

4. 必须字段

在对象描述文件.yaml中,必须包含以下字段。

  • apiVersion:kubernetes API的版本
  • kind:kubernetes对象的类型
  • metadata:唯一标识该对象的元数据,包括name,UID,可选的namespace
  • spec:标识对象的详细信息,不同对象的spec的格式不同,可以嵌套其他对象的字段。

文章参考:

https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/

3.3 - Pod对象

3.3.1 - Pod介绍

1. Pod是什么(what)

1.1. Pod概念

  • Pod是kubernetes集群中最小的部署和管理的基本单元,协同寻址,协同调度。
  • Pod是一个或多个容器的集合,是一个或一组服务(进程)的抽象集合。
  • Pod中可以共享网络和存储(可以简单理解为一个逻辑上的虚拟机,但并不是虚拟机)。
  • Pod被创建后用一个UID来唯一标识,当Pod生命周期结束,被一个等价Pod替代,UID将重新生成。

1.1.1. Pod与Docker

  • Docker是目前Pod最常用的容器环境,但仍支持其他容器环境。
  • Pod是一组被模块化的拥有共享命名空间和共享存储卷的容器,但并没有共享PID 命名空间(即同个Pod的不同容器中进程的PID是独立的,互相看不到非自己容器的进程)。

1.1.2. Pod中容器的运行方式

  1. 只运行一个单独的容器

one-container-per-Pod模式,是最常用的模式,可以把这样的Pod看成单独的一个容器去管理。

  1. 运行多个强关联的容器

sidecar模式,Pod 封装了一组紧耦合、共享资源、协同寻址的容器,将这组容器作为一个管理单元。

1.2. Pod管理多个容器

Pod是一组紧耦合的容器的集合,Pod内的容器作为一个整体以Pod形式进行协同寻址,协同调度、协同管理。相同Pod内的容器共享网络和存储。

1.2.1. 网络

  • 每个Pod被分配了唯一的IP地址,该Pod内的所有容器共享一个网络空间,包括IP和端口。
  • 同个Pod不同容器之间通过localhost通信,Pod内端口不能冲突。
  • 不同Pod之间的通信则通过IP+端口的形式来访问到Pod内的具体服务(容器)。

1.2.2. 存储

  • 可以在Pod中创建共享存储卷的方式来实现不同容器之间数据共享。

2. 为什么需要Pod(why)

2.1. 管理需求

Pod 是一种模式的抽象:互相协作的多个进程(容器)共同形成一个完整的服务。以一个或多个容器的方式组合成一个整体,作为管理的基本单元,通过Pod可以方便部署、水平扩展,协同调度等。

2.2. 资源共享和通信

Pod作为多个紧耦合的容器的集合,通过共享网络和存储的方式来简化紧耦合容器之间的通信,从这个角度,可以将Pod简单理解为一个逻辑上的“虚拟机”。而不同的Pod之间的通信则通过Pod的IP和端口的方式。

2.3. Pod设计的优势

  • 调度器和控制器的可拔插性。
  • 将Pod 的生存期从 controller 中剥离出来,从而减少相互影响。
  • 高可用--在终止和删除 Pod 前,需要提前生成替代 Pod。
  • 集群级别的功能和 Kubelet(Pod Controller) 级别的功能组合更加清晰。

3. Pod的使用(how)

Pod一般是通过各种不同类型的Controller对Pod进行管理和控制,包括自我恢复(例如Pod因异常退出,则会再起一个相同的Pod替代该Pod,而该Pod则会被清除)。也可以不通过Controller单独创建一个Pod,但一般很少这么操作,因为这个Pod是一个孤立的实体,并不会被Controller管理。

3.1. Controller

Controller是kubernetes中用于对Pod进行管理的控制器,通过该控制器让Pod始终维持在一个用户原本设定或期望的状态。如果节点宕机或者Pod因其他原因死亡,则会在其他节点起一个相同的Pod来替代该Pod。

常用的Controller有:

  • Deployment
  • StatefulSet
  • DaemonSet

Controller是通过用户提供的Pod模板来创建和控制Pod。

3.2. Pod模板

Pod模板用来定义Pod的各种属性,Controller通过Pod模板来生成对应的Pod。

Pod模板类似一个饼干模具,通过模具已经生成的饼干与原模具已经没有关系,即对原模具的修改不会影响已经生成的饼干,只会对通过修改后的模具生成的饼干有影响。这种方式可以更加方便地控制和管理Pod。

4. Pod的终止

用户发起一个删除Pod的请求,系统会先发送TERM信号给每个容器的主进程,如果在宽限期(默认30秒)主进程没有自主终止运行,则系统会发送KILL信号给该进程,接着Pod将被删除。

4.1. Pod终止的流程

  1. 用户发送一个删除 Pod 的命令, 并使用默认的宽限期(30s)。
  2. 把 API server 上的 pod 的时间更新成 Pod 与宽限期一起被认为 “dead” 之外的时间点。
  3. 使用客户端的命令,显示出的Pod的状态为 terminating
  4. (与第3步同时发生)Kubelet 发现某一个 Pod 由于时间超过第2步的设置而被标志成 terminating 状态时, Kubelet 将启动一个停止进程。
    1. 如果 pod 已经被定义成一个 preStop hook,这会在 pod 内部进行调用。如果宽限期已经过期但 preStop 锚依然还在运行,将调用第2步并在原来的宽限期上加一个小的时间窗口(2 秒钟)。
    2. 把 Pod 里的进程发送到 TERM 信号。
  5. (与第3步同时发生),Pod 被从终端的服务列表里移除,同时也不再被 replication controllers 看做时一组运行中的 pods。 在负载均衡(比如说 service proxy)会将它们从轮询中移除前, Pods 这种慢关闭的方式可以继续为流量提供服务。
  6. 当宽期限过期时, 任何还在 Pod 里运行的进程都会被 SIGKILL 杀掉。
  7. Kubelet 通过在 API server 把宽期限设置成0(立刻删除)的方式完成删除 Pod的过程。 这时 Pod 在 API 里消失,也不再能被用户看到。

4.2. 强制删除Pod

强制删除Pod是指从k8s集群状态和Etcd中立刻删除对应的Pod数据,API Server不会等待kubelet的确认信息。被强制删除后,即可重新创建一个相同名字的Pod。

删除默认的宽限期是30秒,通过将宽限期设置为0的方式可以强制删除Pod。

通过kubectl delete 命令后加--force--grace-period=0的参数强制删除Pod。

kubectl delete pod <pod_name> --namespace=<namespace>  --force --grace-period=0

4.3. Pod特权模式

特权模式是指让Pod中的进程具有访问宿主机系统设备或使用网络栈操作等的能力,例如编写网络插件和卷插件。

通过将container spec中的SecurityContext设置为privileged即将该容器赋予了特权模式。特权模式的使用要求k8s版本高于v1.1

参考文章:

3.3.2 - Pod定义文件

1. Pod的基本用法

1.1. 说明

  1. Pod实际上是容器的集合,在k8s中对运行容器的要求为:容器的主程序需要一直在前台运行,而不是后台运行。应用可以改造成前台运行的方式,例如Go语言的程序,直接运行二进制文件;java语言则运行主类;tomcat程序可以写个运行脚本。或者通过supervisor的进程管理工具,即supervisor在前台运行,应用程序由supervisor管理在后台运行。具体可参考supervisord
  2. 当多个应用之间是紧耦合的关系时,可以将多个应用一起放在一个Pod中,同个Pod中的多个容器之间互相访问可以通过localhost来通信(可以把Pod理解成一个虚拟机,共享网络和存储卷)。

1.2. Pod相关命令

操作 命令 说明
创建 kubectl create -f frontend-localredis-pod.yaml
查询Pod运行状态 kubectl get pods --namespace=<NAMESPACE>
查询Pod详情 kebectl describe pod <POD_NAME> --namespace=<NAMESPACE> 该命令常用来排查问题,查看Event事件
删除 kubectl delete pod <POD_NAME> ;kubectl delete pod --all
更新 kubectl replace pod.yaml -

2. Pod的定义文件

apiVersion: v1
kind: Pod
metadata:
  name: string
  namaspace: string
  labels:
  - name: string
      annotations:
  - name: string
  spec:
    containers:
  - name: string
    images: string
    imagePullPolice: [Always | Never | IfNotPresent]
    command: [string]
    args: [string]
    workingDir: string
    volumeMounts:
    - name: string
      mountPath: string
      readOnly: boolean
      ports:
    - name: string
      containerPort: int
      hostPort: int
      protocol: string
      env:
    - name: string
      value: string
      resources:
      limits:
        cpu: string
        memory: string
      requests:
        cpu: string
        memory: string
      livenessProbe:
      exec:
        command: [string]
      httpGet:
        path: string
        port: int
        host: string
        scheme: string
        httpHeaders:
        - name: string
          value: string
          tcpSocket:
            port: int
          initialDelaySeconds: number
          timeoutSeconds: number
          periodSeconds: number
          successThreshold: 0
          failureThreshold: 0
          securityContext:
          privileged: false
          restartPolicy: [Always | Never | OnFailure]   
          nodeSelector: object
          imagePullSecrets:
  - name: string
      hostNetwork: false
        volumes:
  - name: string
    emptyDir: {}
    hostPath:
      path: string
    secret:
      secretName: string
      items:
      - key: string
        path: string
        configMap:
          name: string
          items:
      - key: string
        path: string

3. 静态pod

静态Pod是由kubelet进行管理,仅存在于特定Node上的Pod。它们不能通过API Server进行管理,无法与ReplicationController、Deployment或DaemonSet进行关联,并且kubelet也无法对其健康检查。

静态Pod总是由kubelet创建,并且总在kubelet所在的Node上运行。

创建静态Pod的方式:

3.1. 通过配置文件方式

需要设置kubelet的启动参数“–config”,指定kubelet需要监控的配置文件所在目录,kubelet会定期扫描该目录,并根据该目录的.yaml或.json文件进行创建操作。静态Pod无法通过API Server删除(若删除会变成pending状态),如需删除该Pod则将yaml或json文件从这个目录中删除。

例如:

配置目录为/etc/kubelet.d/,配置启动参数:--config=/etc/kubelet.d/,该目录下放入static-web.yaml。

apiVersion: v1
kind: Pod
metadata:
  name: static-web
  labels:
    name: static-web
spec:
  containers:
  - name: static-web
  image: nginx
  ports:
  - name: web
    containerPort: 80

参考文章

  • 《Kubernetes权威指南》

3.3.3 - Pod生命周期

1. Pod phase

Pod的phase是Pod生命周期中的简单宏观描述,定义在Pod的PodStatus对象的phase 字段中。

phase有以下几种值:

状态值 说明
挂起(Pending) Pod 已被 Kubernetes 系统接受,但有一个或者多个容器镜像尚未创建。等待时间包括调度 Pod 的时间和通过网络下载镜像的时间。
运行中(Running) 该 Pod 已经绑定到了一个节点上,Pod 中所有的容器都已被创建。至少有一个容器正在运行,或者正处于启动或重启状态。
成功(Succeeded) Pod 中的所有容器都被成功终止,并且不会再重启。
失败(Failed) Pod 中的所有容器都已终止了,并且至少有一个容器是因为失败终止。也就是说,容器以非0状态退出或者被系统终止。
未知(Unknown) 因为某些原因无法取得 Pod 的状态,通常是因为与 Pod 所在主机通信失败。

2. Pod 状态

Pod 有一个 PodStatus 对象,其中包含一个 PodCondition 数组。 PodCondition包含以下以下字段:

  • lastProbeTime:Pod condition最后一次被探测到的时间戳。
  • lastTransitionTime:Pod最后一次状态转变的时间戳。
  • message:状态转化的信息,一般为报错信息,例如:containers with unready status: [c-1]。
  • reason:最后一次状态形成的原因,一般为报错原因,例如:ContainersNotReady。
  • status:包含的值有 True、False 和 Unknown。
  • type:Pod状态的几种类型。

其中type字段包含以下几个值:

  • PodScheduled:Pod已经被调度到运行节点。
  • Ready:Pod已经可以接收请求提供服务。
  • Initialized:所有的init container已经成功启动。
  • Unschedulable:无法调度该Pod,例如节点资源不够。
  • ContainersReady:Pod中的所有容器已准备就绪。

3. 重启策略

Pod通过restartPolicy字段指定重启策略,重启策略类型为:Always、OnFailure 和 Never,默认为 Always。

restartPolicy 仅指通过同一节点上的 kubelet 重新启动容器。

重启策略 说明
Always 当容器失效时,由kubelet自动重启该容器
OnFailure 当容器终止运行且退出码不为0时,由kubelet自动重启该容器
Never 不论容器运行状态如何,kubelet都不会重启该容器

说明

可以管理Pod的控制器有Replication Controller,Job,DaemonSet,及kubelet(静态Pod)。

  1. RC和DaemonSet:必须设置为Always,需要保证该容器持续运行。
  2. Job:OnFailure或Never,确保容器执行完后不再重启。
  3. kubelet:在Pod失效的时候重启它,不论RestartPolicy设置为什么值,并且不会对Pod进行健康检查。

4. Pod的生命

Pod的生命周期一般通过Controler 的方式管理,每种Controller都会包含PodTemplate来指明Pod的相关属性,Controller可以自动对pod的异常状态进行重新调度和恢复,除非通过Controller的方式删除其管理的Pod,不然kubernetes始终运行用户预期状态的Pod。

控制器的分类

  • 使用 Job运行预期会终止的 Pod,例如批量计算。Job 仅适用于重启策略为 OnFailureNever 的 Pod。
  • 对预期不会终止的 Pod 使用 ReplicationControllerReplicaSetDeployment,例如 Web 服务器。 ReplicationController 仅适用于具有 restartPolicy 为 Always 的 Pod。
  • 提供特定于机器的系统服务,使用 DaemonSet为每台机器运行一个 Pod 。

如果节点死亡或与集群的其余部分断开连接,则 Kubernetes 将应用一个策略将丢失节点上的所有 Pod 的 phase 设置为 Failed

5. Pod状态转换

常见的状态转换

Pod的容器数 Pod当前状态 发生的事件 Pod结果状态
RestartPolicy=Always RestartPolicy=OnFailure RestartPolicy=Never
包含一个容器 Running 容器成功退出 Running Succeeded Succeeded
包含一个容器 Running 容器失败退出 Running Running Failure
包含两个容器 Running 1个容器失败退出 Running Running Running
包含两个容器 Running 容器被OOM杀掉 Running Running Failure

5.1. 容器运行时内存超出限制

  • 容器以失败状态终止。
  • 记录 OOM 事件。
  • 如果restartPolicy为:
    • Always:重启容器;Pod phase 仍为 Running。
    • OnFailure:重启容器;Pod phase 仍为 Running。
    • Never: 记录失败事件;Pod phase 仍为 Failed。

5.2. 磁盘故障

  • 杀掉所有容器。
  • 记录适当事件。
  • Pod phase 变成 Failed。
  • 如果使用控制器来运行,Pod 将在别处重建。

5.3. 运行节点挂掉

  • 节点控制器等待直到超时。
  • 节点控制器将 Pod phase 设置为 Failed。
  • 如果是用控制器来运行,Pod 将在别处重建。

参考文章:

3.3.4 - Pod健康检查

Pod健康检查

Pod的健康状态由两类探针来检查:LivenessProbeReadinessProbe

1. 探针类型

1. livenessProbe(存活探针)

  • 表明容器是否正在运行。
  • 如果存活探测失败,则 kubelet 会杀死容器,并且容器将受到其 重启策略的影响。
  • 如果容器不提供存活探针,则默认状态为 Success

2. readinessProbe(就绪探针)

  • 表明容器是否可以正常接受请求。
  • 如果就绪探测失败,端点控制器将从与 Pod 匹配的所有 Service 的端点中删除该 Pod 的 IP 地址。
  • 初始延迟之前的就绪状态默认为 Failure
  • 如果容器不提供就绪探针,则默认状态为 Success

2. Handler

探针kubelet对容器执行定期的诊断,主要通过调用容器配置的三类Handler实现:

Handler的类型

  • ExecAction:在容器内执行指定命令。如果命令退出时返回码为 0 则认为诊断成功。
  • TCPSocketAction:对指定端口上的容器的 IP 地址进行 TCP 检查。如果端口打开,则诊断被认为是成功的。
  • HTTPGetAction:对指定的端口和路径上的容器的 IP 地址执行 HTTP Get 请求。如果响应的状态码大于等于200 且小于 400,则诊断被认为是成功的。

探测结果为以下三种之一:

  • 成功:容器通过了诊断。
  • 失败:容器未通过诊断。
  • 未知:诊断失败,因此不会采取任何行动。

3. 探针使用方式

  • 如果容器异常可以自动崩溃,则不一定要使用探针,可以由Pod的restartPolicy执行重启操作。
  • 存活探针适用于希望容器探测失败后被杀死并重新启动,需要指定restartPolicy 为 Always 或 OnFailure。
  • 就绪探针适用于希望Pod在不能正常接收流量的时候被剔除,并且在就绪探针探测成功后才接收流量。

存活探针由 kubelet 来执行,因此所有的请求都在 kubelet 的网络命名空间中进行。

3.1. LivenessProbe参数

  • initialDelaySeconds:启动容器后首次进行健康检查的等待时间,单位为秒。
  • timeoutSeconds:健康检查发送请求后等待响应的时间,如果超时响应kubelet则认为容器非健康,重启该容器,单位为秒。

3.2. LivenessProbe三种实现方式

1)ExecAction:在一个容器内部执行一个命令,如果该命令状态返回值为0,则表明容器健康。

apiVersion: v1
kind: Pod
metadata:
  name: liveness-exec
spec:
  containers:
  - name: liveness
    image: tomcagcr.io/google_containers/busybox
    args:
    - /bin/sh
    - -c
    - echo ok > /tmp/health;sleep 10;rm -fr /tmp/health;sleep 600
    livenessProbe:
      exec:
        command:
        - cat
        - /tmp/health
      initialDelaySeconds: 15
      timeoutSeconds: 1

2)TCPSocketAction:通过容器IP地址和端口号执行TCP检查,如果能够建立TCP连接,则表明容器健康。

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-healthcheck
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containnerPort: 80
    livenessProbe:
      tcpSocket:
        port: 80
      initialDelaySeconds: 15
      timeoutSeconds: 1

3)HTTPGetAction:通过容器的IP地址、端口号及路径调用HTTP Get方法,如果响应的状态码大于等于200且小于等于400,则认为容器健康。

apiVersion: v1
kind: Pod
metadata:
  name: pod-with-healthcheck
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containnerPort: 80
    livenessProbe:
      httpGet:
        path: /_status/healthz
        port: 80
      initialDelaySeconds: 15
      timeoutSeconds: 1

参考文章:

3.3.5 - Pod存储卷

Pod Volume

同一个Pod中的多个容器可以共享Pod级别的存储卷Volume,Volume可以定义为各种类型,多个容器各自进行挂载,将Pod的Volume挂载为容器内部需要的目录。

例如:Pod级别的Volume:"app-logs",用于tomcat向其中写日志文件,busybox读日志文件。

这里写图片描述

pod-volumes-applogs.yaml

apiVersion: v1
kind: Pod
metadata:
  name: volume-pod
spec:
  containers:
  - name: tomcat
    image: tomcat
    ports:
    - containerPort: 8080
    volumeMounts:
    - name: app-logs
      mountPath: /usr/local/tomcat/logs
  - name: busybox
    image: busybox
    command: ["sh","-c","tailf /logs/catalina*.log"]
    volumeMounts:
    - name: app-logs
      mountPath: /logs
  volumes:
  - name: app-logs
    emptuDir: {}

查看日志

  1. kubectl logs <pod_name> -c <container_name>
  2. kubectl exec -it <pod_name> -c <container_name> – tail /usr/local/tomcat/logs/catalina.xx.log

参考文章

  • 《Kubernetes权威指南》

3.3.6 - Pod调度

Pod调度

在kubernetes集群中,Pod(container)是应用的载体,一般通过RC、Deployment、DaemonSet、Job等对象来完成Pod的调度与自愈功能。

1. RC、Deployment:全自动调度

RC的功能即保持集群中始终运行着指定个数的Pod。

在调度策略上主要有:

  • 系统内置调度算法[最优Node]
  • NodeSelector[定向调度]
  • NodeAffinity[亲和性调度]

2. NodeSelector[定向调度]

k8s中kube-scheduler负责实现Pod的调度,内部系统通过一系列算法最终计算出最佳的目标节点。如果需要将Pod调度到指定Node上,则可以通过Node的标签(Label)和Pod的nodeSelector属性相匹配来达到目的。

1、kubectl label nodes {node-name} {label-key}={label-value}

2、nodeSelector: {label-key}:{label-value}

如果给多个Node打了相同的标签,则scheduler会根据调度算法从这组Node中选择一个可用的Node来调度。

如果Pod的nodeSelector的标签在Node中没有对应的标签,则该Pod无法被调度成功。

Node标签的使用场景:

对集群中不同类型的Node打上不同的标签,可控制应用运行Node的范围。例如role=frontend;role=backend;role=database。

3. NodeAffinity[亲和性调度]

NodeAffinity意为Node亲和性调度策略,NodeSelector为精确匹配,NodeAffinity为条件范围匹配,通过In(属于)、NotIn(不属于)、Exists(存在一个条件)、DoesNotExist(不存在)、Gt(大于)、Lt(小于)等操作符来选择Node,使调度更加灵活。

  • RequiredDuringSchedulingRequiredDuringExecution:类似于NodeSelector,但在Node不满足条件时,系统将从该Node上移除之前调度上的Pod。
  • RequiredDuringSchedulingIgnoredDuringExecution:与上一个类似,区别是在Node不满足条件时,系统不一定从该Node上移除之前调度上的Pod。
  • PreferredDuringSchedulingIgnoredDuringExecution:指定在满足调度条件的Node中,哪些Node应更优先地进行调度。同时在Node不满足条件时,系统不一定从该Node上移除之前调度上的Pod。

如果同时设置了NodeSelector和NodeAffinity,则系统将需要同时满足两者的设置才能进行调度。

4. DaemonSet:特定场景调度

DaemonSet是kubernetes1.2版本新增的一种资源对象,用于管理在集群中每个Node仅运行一份Pod的副本实例。

这里写图片描述

该用法适用的应用场景:

  • 在每个Node上运行一个GlusterFS存储或者Ceph存储的daemon进程。
  • 在每个Node上运行一个日志采集程序:fluentd或logstach。
  • 在每个Node上运行一个健康程序,采集该Node的运行性能数据,例如:Prometheus Node Exportor、collectd、New Relic agent或Ganglia gmond等。

DaemonSet的Pod调度策略与RC类似,除了使用系统内置算法在每台Node上进行调度,也可以通过NodeSelector或NodeAffinity来指定满足条件的Node范围进行调度。

5. Job:批处理调度

kubernetes从1.2版本开始支持批处理类型的应用,可以通过kubernetes Job资源对象来定义并启动一个批处理任务。批处理任务通常并行(或串行)启动多个计算进程去处理一批工作项(work item),处理完后,整个批处理任务结束。

5.1. 批处理的三种模式

这里写图片描述

批处理按任务实现方式不同分为以下几种模式:

  • Job Template Expansion模式 一个Job对象对应一个待处理的Work item,有几个Work item就产生几个独立的Job,通过适用于Work item数量少,每个Work item要处理的数据量比较大的场景。例如有10个文件(Work item),每个文件(Work item)为100G。

  • Queue with Pod Per Work Item 采用一个任务队列存放Work item,一个Job对象作为消费者去完成这些Work item,其中Job会启动N个Pod,每个Pod对应一个Work item。

  • Queue with Variable Pod Count 采用一个任务队列存放Work item,一个Job对象作为消费者去完成这些Work item,其中Job会启动N个Pod,每个Pod对应一个Work item。但Pod的数量是可变的

5.2. Job的三种类型

1)Non-parallel Jobs

通常一个Job只启动一个Pod,除非Pod异常才会重启该Pod,一旦此Pod正常结束,Job将结束。

2)Parallel Jobs with a fixed completion count

并行Job会启动多个Pod,此时需要设定Job的.spec.completions参数为一个正数,当正常结束的Pod数量达到该值则Job结束。

3)Parallel Jobs with a work queue

任务队列方式的并行Job需要一个独立的Queue,Work item都在一个Queue中存放,不能设置Job的.spec.completions参数。

此时Job的特性:

  • 每个Pod能独立判断和决定是否还有任务项需要处理
  • 如果某个Pod正常结束,则Job不会再启动新的Pod
  • 如果一个Pod成功结束,则此时应该不存在其他Pod还在干活的情况,它们应该都处于即将结束、退出的状态
  • 如果所有的Pod都结束了,且至少一个Pod成功结束,则整个Job算是成功结束

参考文章

  • 《Kubernetes权威指南》

3.3.7 - Pod伸缩与升级

1. Pod伸缩

k8s中RC的用来保持集群中始终运行指定数目的实例,通过RC的scale机制可以完成Pod的扩容和缩容(伸缩)。

1.1. 手动伸缩(scale)

kubectl scale rc redis-slave --replicas=3

1.2. 自动伸缩(HPA)

Horizontal Pod Autoscaler(HPA)控制器用于实现基于CPU使用率进行自动Pod伸缩的功能。HPA控制器基于Master的kube-controller-manager服务启动参数--horizontal-pod-autoscaler-sync-period定义是时长(默认30秒),周期性监控目标Pod的CPU使用率,并在满足条件时对ReplicationController或Deployment中的Pod副本数进行调整,以符合用户定义的平均Pod CPU使用率。Pod CPU使用率来源于heapster组件,因此需安装该组件。

可以通过kubectl autoscale命令进行快速创建或者使用yaml配置文件进行创建。创建之前需已存在一个RC或Deployment对象,并且该RC或Deployment中的Pod必须定义resources.requests.cpu的资源请求值,以便heapster采集到该Pod的CPU。

1.2.1. 通过kubectl autoscale创建

例如:

php-apache-rc.yaml

apiVersion: v1
kind: ReplicationController
metadata:
  name: php-apache
spec:
  replicas: 1
  template:
    metadata:
      name: php-apache
      labels:
        app: php-apache
    spec:
      containers:
      - name: php-apache
        image: gcr.io/google_containers/hpa-example
        resources:
          requests:
            cpu: 200m
        ports:
        - containerPort: 80

创建php-apache的RC

kubectl create -f php-apache-rc.yaml

php-apache-svc.yaml

apiVersion: v1
kind: Service
metadata:
  name: php-apache
spec:
  ports:
  - port: 80
  selector:
    app: php-apache

创建php-apache的Service

kubectl create -f php-apache-svc.yaml

创建HPA控制器

kubectl autoscale rc php-apache --min=1 --max=10 --cpu-percent=50

1.2.2. 通过yaml配置文件创建

hpa-php-apache.yaml

apiVersion: v1
kind: HorizontalPodAutoscaler
metadata:
  name: php-apache
spec:
  scaleTargetRef:
    apiVersion: v1
    kind: ReplicationController
    name: php-apache
  minReplicas: 1
  maxReplicas: 10
  targetCPUUtilizationPercentage: 50

创建hpa

kubectl create -f hpa-php-apache.yaml

查看hpa

kubectl get hpa

2. Pod滚动升级

k8s中的滚动升级通过执行kubectl rolling-update命令完成,该命令创建一个新的RC(与旧的RC在同一个命名空间中),然后自动控制旧的RC中的Pod副本数逐渐减少为0,同时新的RC中的Pod副本数从0逐渐增加到附加值,但滚动升级中Pod副本数(包括新Pod和旧Pod)保持原预期值。

2.1. 通过配置文件实现

redis-master-controller-v2.yaml

apiVersion: v1
kind: ReplicationController
metadata:
  name: redis-master-v2
  labels:
    name: redis-master
    version: v2
spec:
  replicas: 1
  selector:
    name: redis-master
    version: v2
  template:
    metadata:
      labels:
        name: redis-master
        version: v2
    spec:
      containers:
      - name: master
        image: kubeguide/redis-master:2.0
        ports:
        - containerPort: 6371

注意事项:

  1. RC的名字(name)不能与旧RC的名字相同
  2. 在selector中应至少有一个Label与旧的RC的Label不同,以标识其为新的RC。例如本例中新增了version的Label。

运行kubectl rolling-update

kubectl rolling-update redis-master -f redis-master-controller-v2.yaml

2.2. 通过kubectl rolling-update命令实现

kubectl rolling-update redis-master --image=redis-master:2.0

与使用配置文件实现不同在于,该执行结果旧的RC被删除,新的RC仍使用旧的RC的名字。

2.3. 升级回滚

kubectl rolling-update加参数--rollback实现回滚操作

kubectl rolling-update redis-master --image=kubeguide/redis-master:2.0 --rollback

参考文章

  • 《Kubernetes权威指南》

3.4 - 配置

3.4.1 - ConfigMap

Pod的配置管理

Kubernetes v1.2的版本提供统一的集群配置管理方案–ConfigMap。

1. ConfigMap:容器应用的配置管理

使用场景:

  1. 生成为容器内的环境变量。
  2. 设置容器启动命令的启动参数(需设置为环境变量)。
  3. 以Volume的形式挂载为容器内部的文件或目录。

ConfigMap以一个或多个key:value的形式保存在kubernetes系统中供应用使用,既可以表示一个变量的值(例如:apploglevel=info),也可以表示完整配置文件的内容(例如:server.xml=<?xml...>...)。

可以通过yaml配置文件或者使用kubectl create configmap命令的方式创建ConfigMap。

2. 创建ConfigMap

2.1. 通过yaml文件方式

cm-appvars.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-appvars
data:
  apploglevel: info
  appdatadir: /var/data

常用命令

kubectl create -f cm-appvars.yaml

kubectl get configmap

kubectl describe configmap cm-appvars

kubectl get configmap cm-appvars -o yaml

2.2. 通过kubectl命令行方式

通过kubectl create configmap创建,使用参数--from-file或--from-literal指定内容,可以在一行中指定多个参数。

1)通过--from-file参数从文件中进行创建,可以指定key的名称,也可以在一个命令行中创建包含多个key的ConfigMap。

kubectl create configmap NAME --from-file=[key=]source --from-file=[key=]source

2)通过--from-file参数从目录中进行创建,该目录下的每个配置文件名被设置为key,文件内容被设置为value。

kubectl create configmap NAME --from-file=config-files-dir

3)通过--from-literal从文本中进行创建,直接将指定的key=value创建为ConfigMap的内容。

kubectl create configmap NAME --from-literal=key1=value1 --from-literal=key2=value2

容器应用对ConfigMap的使用有两种方法:

  1. 通过环境变量获取ConfigMap中的内容。
  2. 通过Volume挂载的方式将ConfigMap中的内容挂载为容器内部的文件或目录。

2.3. 通过环境变量的方式

ConfigMap的yaml文件:cm-appvars.yaml

apiVersion: v1
kind: ConfigMap
metadata:
  name: cm-appvars
data:
  apploglevel: info
  appdatadir: /var/data

Pod的yaml文件:cm-test-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  name: cm-test-pod
spec:
  containers:
  - name: cm-test
    image: busybox
    command: ["/bin/sh","-c","env|grep APP"]
    env:
    - name: APPLOGLEVEL
      valueFrom:
        configMapKeyRef:
          name: cm-appvars
          key: apploglevel
    - name: APPDATADIR
      valueFrom:
        configMapKeyRef:
          name: cm-appvars
          key: appdatadir

创建命令:

kubectl create -f cm-test-pod.yaml

kubectl get pods --show-all

kubectl logs cm-test-pod

3. 使用ConfigMap的限制条件

  • ConfigMap必须在Pod之前创建
  • ConfigMap也可以定义为属于某个Namespace。只有处于相同Namespace中的Pod可以引用它。
  • kubelet只支持可以被API Server管理的Pod使用ConfigMap。静态Pod无法引用。
  • 在Pod对ConfigMap进行挂载操作时,容器内只能挂载为“目录”,无法挂载为文件。

参考文章

  • 《Kubernetes权威指南》

4 - 核心原理

4.1 - 核心组件

4.1.1 - Kubernetes核心原理(一)之API Server

1. API Server简介

k8s API Server提供了k8s各类资源对象(pod,RC,Service等)的增删改查及watch等HTTP Rest接口,是整个系统的数据总线和数据中心。

kubernetes API Server的功能:

  1. 提供了集群管理的REST API接口(包括认证授权、数据校验以及集群状态变更);
  2. 提供其他模块之间的数据交互和通信的枢纽(其他模块通过API Server查询或修改数据,只有API Server才直接操作etcd);
  3. 是资源配额控制的入口;
  4. 拥有完备的集群安全机制.

kube-apiserver工作原理图

kube-apiserver

2. 如何访问kubernetes API

k8s通过kube-apiserver这个进程提供服务,该进程运行在单个k8s-master节点上。默认有两个端口。

2.1. 本地端口

  1. 该端口用于接收HTTP请求;
  2. 该端口默认值为8080,可以通过API Server的启动参数“--insecure-port”的值来修改默认值;
  3. 默认的IP地址为“localhost”,可以通过启动参数“--insecure-bind-address”的值来修改该IP地址;
  4. 非认证或授权的HTTP请求通过该端口访问API Server。

2.2. 安全端口

  1. 该端口默认值为6443,可通过启动参数“--secure-port”的值来修改默认值;
  2. 默认IP地址为非本地(Non-Localhost)网络端口,通过启动参数“--bind-address”设置该值;
  3. 该端口用于接收HTTPS请求;
  4. 用于基于Tocken文件或客户端证书及HTTP Base的认证;
  5. 用于基于策略的授权;
  6. 默认不启动HTTPS安全访问控制。

2.3. 访问方式

Kubernetes REST API可参考https://kubernetes.io/docs/api-reference/v1.6/

2.3.1. curl

curl localhost:8080/api
curl localhost:8080/api/v1/pods
curl localhost:8080/api/v1/services
curl localhost:8080/api/v1/replicationcontrollers

2.3.2. Kubectl Proxy

Kubectl Proxy代理程序既能作为API Server的反向代理,也能作为普通客户端访问API Server的代理。通过master节点的8080端口来启动该代理程序。

kubectl proxy --port=8080 &

具体见kubectl proxy --help

[root@node5 ~]# kubectl proxy --help
To proxy all of the kubernetes api and nothing else, use:
kubectl proxy --api-prefix=/
To proxy only part of the kubernetes api and also some static files:
kubectl proxy --www=/my/files --www-prefix=/static/ --api-prefix=/api/
The above lets you 'curl localhost:8001/api/v1/pods'.
To proxy the entire kubernetes api at a different root, use:
kubectl proxy --api-prefix=/custom/
The above lets you 'curl localhost:8001/custom/api/v1/pods'
Usage:
  kubectl proxy [--port=PORT] [--www=static-dir] [--www-prefix=prefix] [--api-prefix=prefix] [flags]
Examples:
# Run a proxy to kubernetes apiserver on port 8011, serving static content from ./local/www/
$ kubectl proxy --port=8011 --www=./local/www/
# Run a proxy to kubernetes apiserver on an arbitrary local port.
# The chosen port for the server will be output to stdout.
$ kubectl proxy --port=0
# Run a proxy to kubernetes apiserver, changing the api prefix to k8s-api
# This makes e.g. the pods api available at localhost:8011/k8s-api/v1/pods/
$ kubectl proxy --api-prefix=/k8s-api
Flags:
      --accept-hosts="^localhost$,^127//.0//.0//.1$,^//[::1//]$": Regular expression for hosts that the proxy should accept.
      --accept-paths="^/.*": Regular expression for paths that the proxy should accept.
      --api-prefix="/": Prefix to serve the proxied API under.
      --disable-filter[=false]: If true, disable request filtering in the proxy. This is dangerous, and can leave you vulnerable to XSRF attacks, when used with an accessible port.
  -p, --port=8001: The port on which to run the proxy. Set to 0 to pick a random port.
      --reject-methods="POST,PUT,PATCH": Regular expression for HTTP methods that the proxy should reject.
      --reject-paths="^/api/.*/exec,^/api/.*/run": Regular expression for paths that the proxy should reject.
  -u, --unix-socket="": Unix socket on which to run the proxy.
  -w, --www="": Also serve static files from the given directory under the specified prefix.
  -P, --www-prefix="/static/": Prefix to serve static files under, if static file directory is specified.
 
Global Flags:
      --alsologtostderr[=false]: log to standard error as well as files
      --api-version="": The API version to use when talking to the server
      --certificate-authority="": Path to a cert. file for the certificate authority.
      --client-certificate="": Path to a client key file for TLS.
      --client-key="": Path to a client key file for TLS.
      --cluster="": The name of the kubeconfig cluster to use
      --context="": The name of the kubeconfig context to use
      --insecure-skip-tls-verify[=false]: If true, the server's certificate will not be checked for validity. This will make your HTTPS connections insecure.
      --kubeconfig="": Path to the kubeconfig file to use for CLI requests.
      --log-backtrace-at=:0: when logging hits line file:N, emit a stack trace
      --log-dir="": If non-empty, write log files in this directory
      --log-flush-frequency=5s: Maximum number of seconds between log flushes
      --logtostderr[=true]: log to standard error instead of files
      --match-server-version[=false]: Require server version to match client version
      --namespace="": If present, the namespace scope for this CLI request.
      --password="": Password for basic authentication to the API server.
  -s, --server="": The address and port of the Kubernetes API server
      --stderrthreshold=2: logs at or above this threshold go to stderr
      --token="": Bearer token for authentication to the API server.
      --user="": The name of the kubeconfig user to use
      --username="": Username for basic authentication to the API server.
      --v=0: log level for V logs
      --vmodule=: comma-separated list of pattern=N settings for file-filtered logging

2.3.3. kubectl客户端

命令行工具kubectl客户端,通过命令行参数转换为对API Server的REST API调用,并将调用结果输出。

命令格式:kubectl [command] [options]

具体可参考k8s常用命令

2.3.4. 编程方式调用

使用场景:

1、运行在Pod里的用户进程调用kubernetes API,通常用来实现分布式集群搭建的目标。

2、开发基于kubernetes的管理平台,比如调用kubernetes API来完成Pod、Service、RC等资源对象的图形化创建和管理界面。可以使用kubernetes提供的Client Library。

具体可参考https://github.com/kubernetes/client-go

3. 通过API Server访问Node、Pod和Service

k8s API Server最主要的REST接口是资源对象的增删改查,另外还有一类特殊的REST接口—k8s Proxy API接口,这类接口的作用是代理REST请求,即kubernetes API Server把收到的REST请求转发到某个Node上的kubelet守护进程的REST端口上,由该kubelet进程负责响应。

3.1. Node相关接口

关于Node相关的接口的REST路径为:/api/v1/proxy/nodes/{name},其中{name}为节点的名称或IP地址。

/api/v1/proxy/nodes/{name}/pods/    #列出指定节点内所有Pod的信息
/api/v1/proxy/nodes/{name}/stats/   #列出指定节点内物理资源的统计信息
/api/v1/prxoy/nodes/{name}/spec/    #列出指定节点的概要信息

这里获取的Pod信息来自Node而非etcd数据库,两者时间点可能存在偏差。如果在kubelet进程启动时加--enable-debugging-handles=true参数,那么kubernetes Proxy API还会增加以下接口:

/api/v1/proxy/nodes/{name}/run      #在节点上运行某个容器
/api/v1/proxy/nodes/{name}/exec     #在节点上的某个容器中运行某条命令
/api/v1/proxy/nodes/{name}/attach   #在节点上attach某个容器
/api/v1/proxy/nodes/{name}/portForward   #实现节点上的Pod端口转发
/api/v1/proxy/nodes/{name}/logs     #列出节点的各类日志信息
/api/v1/proxy/nodes/{name}/metrics  #列出和该节点相关的Metrics信息
/api/v1/proxy/nodes/{name}/runningpods  #列出节点内运行中的Pod信息
/api/v1/proxy/nodes/{name}/debug/pprof  #列出节点内当前web服务的状态,包括CPU和内存的使用情况

3.2. Pod相关接口

/api/v1/proxy/namespaces/{namespace}/pods/{name}/{path:*}      #访问pod的某个服务接口
/api/v1/proxy/namespaces/{namespace}/pods/{name}               #访问Pod
#以下写法不同,功能一样
/api/v1/namespaces/{namespace}/pods/{name}/proxy/{path:*}      #访问pod的某个服务接口
/api/v1/namespaces/{namespace}/pods/{name}/proxy               #访问Pod

3.3. Service相关接口

/api/v1/proxy/namespaces/{namespace}/services/{name}

Pod的proxy接口的作用:在kubernetes集群之外访问某个pod容器的服务(HTTP服务),可以用Proxy API实现,这种场景多用于管理目的,比如逐一排查Service的Pod副本,检查哪些Pod的服务存在异常问题。

4. 集群功能模块之间的通信

kubernetes API Server作为集群的核心,负责集群各功能模块之间的通信,集群内各个功能模块通过API Server将信息存入etcd,当需要获取和操作这些数据时,通过API Server提供的REST接口(GET/LIST/WATCH方法)来实现,从而实现各模块之间的信息交互。

4.1. kubelet与API Server交互

每个Node节点上的kubelet定期就会调用API Server的REST接口报告自身状态,API Server接收这些信息后,将节点状态信息更新到etcd中。kubelet也通过API Server的Watch接口监听Pod信息,从而对Node机器上的POD进行管理。

监听信息 kubelet动作
新的POD副本被调度绑定到本节点 执行POD对应的容器的创建和启动逻辑
POD对象被删除 删除本节点上相应的POD容器
修改POD信息 修改本节点的POD容器

4.2. kube-controller-manager与API Server交互

kube-controller-manager中的Node Controller模块通过API Server提供的Watch接口,实时监控Node的信息,并做相应处理。

4.3. kube-scheduler与API Server交互

Scheduler通过API Server的Watch接口监听到新建Pod副本的信息后,它会检索所有符合该Pod要求的Node列表,开始执行Pod调度逻辑。调度成功后将Pod绑定到目标节点上。

4.4. 特别说明

为了缓解各模块对API Server的访问压力,各功能模块都采用缓存机制来缓存数据,各功能模块定时从API Server获取指定的资源对象信息(LIST/WATCH方法),然后将信息保存到本地缓存,功能模块在某些情况下不直接访问API Server,而是通过访问缓存数据来间接访问API Server。

参考《kubernetes权威指南》

4.1.2 - Kubernetes核心原理(二)之Controller Manager

1. Controller Manager简介

Controller Manager作为集群内部的管理控制中心,负责集群内的Node、Pod副本、服务端点(Endpoint)、命名空间(Namespace)、服务账号(ServiceAccount)、资源定额(ResourceQuota)的管理,当某个Node意外宕机时,Controller Manager会及时发现并执行自动化修复流程,确保集群始终处于预期的工作状态。

controller manager

每个Controller通过API Server提供的接口实时监控整个集群的每个资源对象的当前状态,当发生各种故障导致系统状态发生变化时,会尝试将系统状态修复到“期望状态”。

2. Replication Controller

为了区分,将资源对象Replication Controller简称RC,而本文中是指Controller Manager中的Replication Controller,称为副本控制器。副本控制器的作用即保证集群中一个RC所关联的Pod副本数始终保持预设值。

  1. 只有当Pod的重启策略是Always的时候(RestartPolicy=Always),副本控制器才会管理该Pod的操作(创建、销毁、重启等)。
  2. RC中的Pod模板就像一个模具,模具制造出来的东西一旦离开模具,它们之间就再没关系了。一旦Pod被创建,无论模板如何变化,也不会影响到已经创建的Pod。
  3. Pod可以通过修改label来脱离RC的管控,该方法可以用于将Pod从集群中迁移,数据修复等调试。
  4. 删除一个RC不会影响它所创建的Pod,如果要删除Pod需要将RC的副本数属性设置为0。
  5. 不要越过RC创建Pod,因为RC可以实现自动化控制Pod,提高容灾能力。

2.1. Replication Controller的职责

  1. 确保集群中有且仅有N个Pod实例,N是RC中定义的Pod副本数量。
  2. 通过调整RC中的spec.replicas属性值来实现系统扩容或缩容。
  3. 通过改变RC中的Pod模板来实现系统的滚动升级。

2.2. Replication Controller使用场景

使用场景 说明 使用命令
重新调度 当发生节点故障或Pod被意外终止运行时,可以重新调度保证集群中仍然运行指定的副本数。
弹性伸缩 通过手动或自动扩容代理修复副本控制器的spec.replicas属性,可以实现弹性伸缩。 kubectl scale
滚动更新 创建一个新的RC文件,通过kubectl 命令或API执行,则会新增一个新的副本同时删除旧的副本,当旧副本为0时,删除旧的RC。 kubectl rolling-update

滚动升级,具体可参考kubectl rolling-update --help,官方文档:https://kubernetes.io/docs/tasks/run-application/rolling-update-replication-controller/

3. Node Controller

kubelet在启动时会通过API Server注册自身的节点信息,并定时向API Server汇报状态信息,API Server接收到信息后将信息更新到etcd中。

Node Controller通过API Server实时获取Node的相关信息,实现管理和监控集群中的各个Node节点的相关控制功能。流程如下

Node Controller

1、Controller Manager在启动时如果设置了--cluster-cidr参数,那么为每个没有设置Spec.PodCIDR的Node节点生成一个CIDR地址,并用该CIDR地址设置节点的Spec.PodCIDR属性,防止不同的节点的CIDR地址发生冲突。

2、具体流程见以上流程图。

3、逐个读取节点信息,如果节点状态变成非“就绪”状态,则将节点加入待删除队列,否则将节点从该队列删除。

4. ResourceQuota Controller

资源配额管理确保指定的资源对象在任何时候都不会超量占用系统物理资源。

支持三个层次的资源配置管理:

1)容器级别:对CPU和Memory进行限制

2)Pod级别:对一个Pod内所有容器的可用资源进行限制

3)Namespace级别:包括

  • Pod数量
  • Replication Controller数量
  • Service数量
  • ResourceQuota数量
  • Secret数量
  • 可持有的PV(Persistent Volume)数量

说明:

  1. k8s配额管理是通过Admission Control(准入控制)来控制的;
  2. Admission Control提供两种配额约束方式:LimitRanger和ResourceQuota;
  3. LimitRanger作用于Pod和Container;
  4. ResourceQuota作用于Namespace上,限定一个Namespace里的各类资源的使用总额。

ResourceQuota Controller流程图

ResourceQuota Controller

5. Namespace Controller

用户通过API Server可以创建新的Namespace并保存在etcd中,Namespace Controller定时通过API Server读取这些Namespace信息。

如果Namespace被API标记为优雅删除(即设置删除期限,DeletionTimestamp),则将该Namespace状态设置为“Terminating”,并保存到etcd中。同时Namespace Controller删除该Namespace下的ServiceAccount、RC、Pod等资源对象。

6. Endpoint Controller

Service、Endpoint、Pod的关系:

Endpoint Controller

Endpoints表示了一个Service对应的所有Pod副本的访问地址,而Endpoints Controller负责生成和维护所有Endpoints对象的控制器。它负责监听Service和对应的Pod副本的变化。

  1. 如果监测到Service被删除,则删除和该Service同名的Endpoints对象;
  2. 如果监测到新的Service被创建或修改,则根据该Service信息获得相关的Pod列表,然后创建或更新Service对应的Endpoints对象。
  3. 如果监测到Pod的事件,则更新它对应的Service的Endpoints对象。

kube-proxy进程获取每个Service的Endpoints,实现Service的负载均衡功能。

7. Service Controller

Service Controller是属于kubernetes集群与外部的云平台之间的一个接口控制器。Service Controller监听Service变化,如果是一个LoadBalancer类型的Service,则确保外部的云平台上对该Service对应的LoadBalancer实例被相应地创建、删除及更新路由转发表。

参考《Kubernetes权威指南》

4.1.3 - Kubernetes核心原理(三)之Scheduler

1. Scheduler简介

Scheduler负责Pod调度。在整个系统中起"承上启下"作用,承上:负责接收Controller Manager创建的新的Pod,为其选择一个合适的Node;启下:Node上的kubelet接管Pod的生命周期。

Scheduler:

1)通过调度算法为待调度Pod列表的每个Pod从Node列表中选择一个最适合的Node,并将信息写入etcd中

2)kubelet通过API Server监听到kubernetes Scheduler产生的Pod绑定信息,然后获取对应的Pod清单,下载Image,并启动容器。

scheduler

2. 调度流程

1、预选调度过程,即遍历所有目标Node,筛选出符合要求的候选节点,kubernetes内置了多种预选策略(xxx Predicates)供用户选择

2、确定最优节点,在第一步的基础上采用优选策略(xxx Priority)计算出每个候选节点的积分,取最高积分。

调度流程通过插件式加载的“调度算法提供者”(AlgorithmProvider)具体实现,一个调度算法提供者就是包括一组预选策略与一组优选策略的结构体。

3. 预选策略

说明:返回true表示该节点满足该Pod的调度条件;返回false表示该节点不满足该Pod的调度条件。

3.1. NoDiskConflict

判断备选Pod的数据卷是否与该Node上已存在Pod挂载的数据卷冲突,如果是则返回false,否则返回true。

3.2. PodFitsResources

判断备选节点的资源是否满足备选Pod的需求,即节点的剩余资源满不满足该Pod的资源使用。

  1. 计算备选Pod和节点中已用资源(该节点所有Pod的使用资源)的总和。
  2. 获取备选节点的状态信息,包括节点资源信息。
  3. 如果(备选Pod+节点已用资源>该节点总资源)则返回false,即剩余资源不满足该Pod使用;否则返回true。

3.3. PodSelectorMatches

判断节点是否包含备选Pod的标签选择器指定的标签,即通过标签来选择Node。

  1. 如果Pod中没有指定spec.nodeSelector,则返回true。
  2. 否则获得备选节点的标签信息,判断该节点的标签信息中是否包含该Pod的spec.nodeSelector中指定的标签,如果包含返回true,否则返回false。

3.4. PodFitsHost

判断备选Pod的spec.nodeName所指定的节点名称与备选节点名称是否一致,如果一致返回true,否则返回false。

3.5. CheckNodeLabelPresence

检查备选节点中是否有Scheduler配置的标签,如果有返回true,否则返回false。

3.6. CheckServiceAffinity

判断备选节点是否包含Scheduler配置的标签,如果有返回true,否则返回false。

3.7. PodFitsPorts

判断备选Pod所用的端口列表中的端口是否在备选节点中已被占用,如果被占用返回false,否则返回true。

4. 优选策略

4.1. LeastRequestedPriority

优先从备选节点列表中选择资源消耗最小的节点(CPU+内存)。

4.2. CalculateNodeLabelPriority

优先选择含有指定Label的节点。

4.3. BalancedResourceAllocation

优先从备选节点列表中选择各项资源使用率最均衡的节点。

参考《Kubernetes权威指南》

4.1.4 - Kubernetes核心原理(四)之kubelet

1. kubelet简介

在kubernetes集群中,每个Node节点都会启动kubelet进程,用来处理Master节点下发到本节点的任务,管理Pod和其中的容器。kubelet会在API Server上注册节点信息,定期向Master汇报节点资源使用情况,并通过cAdvisor监控容器和节点资源。可以把kubelet理解成【Server-Agent】架构中的agent,是Node上的pod管家。

更多kubelet配置参数信息可参考kubelet --help

2. 节点管理

节点通过设置kubelet的启动参数“--register-node”,来决定是否向API Server注册自己,默认为true。可以通过kubelet --help或者查看kubernetes源码【cmd/kubelet/app/server.go中】来查看该参数。

kubelet的配置文件

默认配置文件在/etc/kubernetes/kubelet中,其中

  • --api-servers:用来配置Master节点的IP和端口。
  • --kubeconfig:用来配置kubeconfig的路径,kubeconfig文件常用来指定证书。
  • --hostname-override:用来配置该节点在集群中显示的主机名。
  • --node-status-update-frequency:配置kubelet向Master心跳上报的频率,默认为10s。

3. Pod管理

kubelet有几种方式获取自身Node上所需要运行的Pod清单。但本文只讨论通过API Server监听etcd目录,同步Pod列表的方式。

kubelet通过API Server Client使用WatchAndList的方式监听etcd中/registry/nodes/${当前节点名称}和/registry/pods的目录,将获取的信息同步到本地缓存中。

kubelet监听etcd,执行对Pod的操作,对容器的操作则是通过Docker Client执行,例如启动删除容器等。

kubelet创建和修改Pod流程:

  1. 为该Pod创建一个数据目录。
  2. 从API Server读取该Pod清单。
  3. 为该Pod挂载外部卷(External Volume)
  4. 下载Pod用到的Secret。
  5. 检查运行的Pod,执行Pod中未完成的任务。
  6. 先创建一个Pause容器,该容器接管Pod的网络,再创建其他容器。
  7. Pod中容器的处理流程: 1)比较容器hash值并做相应处理。 2)如果容器被终止了且没有指定重启策略,则不做任何处理。 3)调用Docker Client下载容器镜像,调用Docker Client运行容器。

4. 容器健康检查

Pod通过探针的方式来检查容器的健康状态,具体可参考Pod详解#Pod健康检查

5. cAdvisor资源监控

kubelet通过cAdvisor获取本节点信息及容器的数据。cAdvisor为谷歌开源的容器资源分析工具,默认集成到kubernetes中。

cAdvisor自动采集CPU,内存,文件系统,网络使用情况,容器中运行的进程,默认端口为4194。可以通过Node IP+Port访问。

更多参考:http://github.com/google/cadvisor

参考《Kubernetes权威指南》

4.2 - 流程图

4.2.1 - Pod创建流程

Pod创建基本流程图

Pod创建完整流程图

图片来源:https://fuckcloudnative.io/posts/what-happens-when-k8s/

参考:

4.2.2 - PVC创建流程

pvc流程

流程如下:

  1. 用户创建了一个包含 PVC 的 Pod,该 PVC 要求使用动态存储卷;
  2. Scheduler 根据 Pod 配置、节点状态、PV 配置等信息,把 Pod 调度到一个合适的 Worker 节点上;
  3. PV 控制器 watch 到该 Pod 使用的 PVC 处于 Pending 状态,于是调用 Volume Plugin(in-tree)创建存储卷,并创建 PV 对象(out-of-tree 由 External Provisioner 来处理);
  4. AD 控制器发现 Pod 和 PVC 处于待挂接状态,于是调用 Volume Plugin 挂接存储设备到目标 Worker 节点上
  5. 在 Worker 节点上,Kubelet 中的 Volume Manager 等待存储设备挂接完成,并通过 Volume Plugin 将设备挂载到全局目录:/var/lib/kubelet/pods/[pod uid]/volumes/kubernetes.io~iscsi/[PVname](以 iscsi 为例);
  6. Kubelet 通过 Docker 启动 Pod 的 Containers,用 bind mount 方式将已挂载到本地全局目录的卷映射到容器中。

详细流程图

5 - 容器网络

5.1 - Docker网络

1. Docker的网络基础

1.1. Network Namespace

不同的网络命名空间中,协议栈是独立的,完全隔离,彼此之间无法通信。同一个网络命名空间有独立的路由表和独立的Iptables/Netfilter来提供包的转发、NAT、IP包过滤等功能。

1.1.1. 网络命名空间的实现

将与网络协议栈相关的全局变量变成一个Net Namespace变量的成员,然后在调用协议栈函数中加入一个Namepace参数。

1.1.2. 网络命名空间的操作

1、创建网络命名空间

ip netns add name

2、在命名空间内执行命令

ip netns exec name command

3、进入命名空间

ip netns exec name bash

2. Docker的网络实现

2.1. 容器网络

Docker使用Linux桥接,在宿主机虚拟一个Docker容器网桥(docker0),Docker启动一个容器时会根据Docker网桥的网段分配给容器一个IP地址,称为Container-IP,同时Docker网桥是每个容器的默认网关。因为在同一宿主机内的容器都接入同一个网桥,这样容器之间就能够通过容器的Container-IP直接通信。

Docker网桥是宿主机虚拟出来的,并不是真实存在的网络设备,外部网络是无法寻址到的,这也意味着外部网络无法通过直接Container-IP访问到容器。如果容器希望外部访问能够访问到,可以通过映射容器端口到宿主主机(端口映射),即docker run创建容器时候通过 -p 或 -P 参数来启用,访问容器的时候就通过[宿主机IP]:[容器端口]访问容器。

这里写图片描述

2.2. 4类网络模式

Docker网络模式 配置 说明
host模式 --net=host 容器和宿主机共享Network namespace。
container模式 --net=container:NAME_or_ID 容器和另外一个容器共享Network namespace。 kubernetes中的pod就是多个容器共享一个Network namespace。
none模式 --net=none 容器有独立的Network namespace,但并没有对其进行任何网络设置,如分配veth pair 和网桥连接,配置IP等。
bridge模式 --net=bridge(默认为该模式) 桥接模式

3. Docker网络模式

3.1. bridge桥接模式

在bridge模式下,Docker可以使用独立的网络栈。实现方式是父进程在创建子进程的时候通过传入CLONE_NEWNET的参数创建出一个网络命名空间。

实现步骤:

  1. Docker Daemon首次启动时会创建一个虚拟网桥docker0,地址通常为172.x.x.x开头,在私有的网络空间中给这个网络分配一个子网。
  2. 由Docker创建处理的每个容器,都会创建一个虚拟以太设备对(veth pair),一端关联到网桥,另一端使用Namespace技术映射到容器内的eth0设备,然后从网桥的地址段内给eth0接口分配一个IP地址。

这里写图片描述

一般情况,宿主机IP与docker0 IP、容器IP是不同的IP段,默认情况,外部看不到docker0和容器IP,对于外部来说相当于docker0和容器的IP为内网IP。

3.1.1. 外部网络访问Docker容器

外部访问docker容器可以通过端口映射(NAT)的方式,Docker使用NAT的方式将容器内部的服务与宿主机的某个端口port_1绑定。

外部访问容器的流程如下:

  1. 外界网络通过宿主机的IP和映射的端口port_1访问。
  2. 当宿主机收到此类请求,会通过DNAT将请求的目标IP即宿主机IP和目标端口即映射端口port_1替换成容器的IP和容器的端口port_0。
  3. 由于宿主机上可以识别容器IP,所以宿主机将请求发给veth pair。
  4. veth pair将请求发送给容器内部的eth0,由容器内部的服务进行处理。

3.1.2. Docker容器访问外部网络

docker容器访问外部网络的流程:

  1. docker容器向外部目标IP和目标端口port_2发起请求,请求报文中的源IP为容器IP。

  2. 请求通过容器内部的eth0到veth pair的另一端docker0网桥。

  3. docker0网桥通过数据报转发功能将请求转发到宿主机的eth0。

  4. 宿主机处理请求时通过SNAT将请求中的源IP换成宿主机eth0的IP。

  5. 处理后的报文通过请求的目标IP发送到外部网络。

3.1.3. 缺点

使用NAT的方式可能会带来性能的问题,影响网络传输效率。

3.2. host模式

host模式并没有给容器创建一个隔离的网络环境,而是和宿主机共用一个网络命名空间,容器使用宿主机的eth0和外界进行通信,同样容器也共用宿主机的端口资源,即分配端口可能存在与宿主机已分配的端口冲突的问题。

实现的方式即父进程在创建子进程的时候不传入CLONE_NEWNET的参数,从而和宿主机共享一个网络空间。

host模式没有通过NAT的方式进行转发因此性能上相对较好,但是不存在网络隔离性,可能产生端口冲突的问题。

3.3. container模式

container模式即docker容器可以使用其他容器的网络命名空间,即和其他容器处于同一个网络命名空间。

步骤:

  1. 查找其他容器的网络命名空间。
  2. 新创建的容器的网络命名空间使用其他容器的网络命名空间。

通过和其他容器共享网络命名空间的方式,可以让不同的容器之间处于相同的网络命名空间,可以直接通过localhost的方式进行通信,简化了强关联的多个容器之间的通信问题。

k8s中的pod的概念就是通过一组容器共享一个网络命名空间来达到pod内部的不同容器可以直接通过localhost的方式进行通信。

3.4. none模式

none模式即不为容器创建任何的网络环境,用户可以根据自己的需要手动去创建不同的网络定制配置。

参考:

  • 《Docker源码分析》

5.2 - K8S网络

1. kubernetes网络模型

1.1. 基础原则

  1. 每个Pod都拥有一个独立的IP地址,而且假定所有Pod都在一个可以直接连通的、扁平的网络空间中,不管是否运行在同一Node上都可以通过Pod的IP来访问。
  2. k8s中Pod的IP是最小粒度IP。同一个Pod内所有的容器共享一个网络堆栈,该模型称为IP-per-Pod模型。
  3. Pod由docker0实际分配的IP,Pod内部看到的IP地址和端口与外部保持一致。同一个Pod内的不同容器共享网络,可以通过localhost来访问对方的端口,类似同一个VM内的不同进程。
  4. IP-per-Pod模型从端口分配、域名解析、服务发现、负载均衡、应用配置等角度看,Pod可以看作是一台独立的VM或物理机。

1.2. k8s对集群的网络要求

  1. 所有容器都可以不用NAT的方式同别的容器通信。
  2. 所有节点都可以在不同NAT的方式下同所有容器通信,反之亦然。
  3. 容器的地址和别人看到的地址是同一个地址。

以上的集群网络要求可以通过第三方开源方案实现,例如flannel。

1.3. 网络架构图

这里写图片描述

1.4. k8s集群IP概念汇总

由集群外部到集群内部:

IP类型 说明
Proxy-IP 代理层公网地址IP,外部访问应用的网关服务器。[实际需要关注的IP]
Service-IP Service的固定虚拟IP,Service-IP是内部,外部无法寻址到。
Node-IP 容器宿主机的主机IP。
Container-Bridge-IP 容器网桥(docker0)IP,容器的网络都需要通过容器网桥转发。
Pod-IP Pod的IP,等效于Pod中网络容器的Container-IP。
Container-IP 容器的IP,容器的网络是个隔离的网络空间。

2. kubernetes的网络实现

k8s网络场景

  1. 容器与容器之间的直接通信。
  2. Pod与Pod之间的通信。
  3. Pod到Service之间的通信。
  4. 集群外部与内部组件之间的通信。

2.1. Pod网络

Pod作为kubernetes的最小调度单元,Pod是容器的集合,是一个逻辑概念,Pod包含的容器都运行在同一个宿主机上,这些容器将拥有同样的网络空间,容器之间能够互相通信,它们能够在本地访问其它容器的端口。 实际上Pod都包含一个网络容器,它不做任何事情,只是用来接管Pod的网络,业务容器通过加入网络容器的网络从而实现网络共享。Pod网络本质上还是容器网络,所以Pod-IP就是网络容器的Container-IP。

一般将容器云平台的网络模型打造成一个扁平化网络平面,在这个网络平面内,Pod作为一个网络单元同Kubernetes Node的网络处于同一层级。

2.2. Pod内部容器之间的通信

同一个Pod之间的不同容器因为共享同一个网络命名空间,所以可以直接通过localhost直接通信。

2.3. Pod之间的通信

2.3.1. 同Node的Pod之间的通信

同一个Node内,不同的Pod都有一个全局IP,可以直接通过Pod的IP进行通信。Pod地址和docker0在同一个网段。

在pause容器启动之前,会创建一个虚拟以太网接口对(veth pair),该接口对一端连着容器内部的eth0 ,一端连着容器外部的vethxxx,vethxxx会绑定到容器运行时配置使用的网桥bridge0上,从该网络的IP段中分配IP给容器的eth0。

当同节点上的Pod-A发包给Pod-B时,包传送路线如下:

pod-a的eth0—>pod-a的vethxxx—>bridge0—>pod-b的vethxxx—>pod-b的eth0

因为相同节点的bridge0是相通的,因此可以通过bridge0来完成不同pod直接的通信,但是不同节点的bridge0是不通的,因此不同节点的pod之间的通信需要将不同节点的bridge0给连接起来。

2.3.2. 不同Node的Pod之间的通信

不同的Node之间,Node的IP相当于外网IP,可以直接访问,而Node内的docker0和Pod的IP则是内网IP,无法直接跨Node访问。需要通过Node的网卡进行转发。

所以不同Node之间的通信需要达到两个条件:

  1. 对整个集群中的Pod-IP分配进行规划,不能有冲突(可以通过第三方开源工具来管理,例如flannel)。
  2. 将Node-IP与该Node上的Pod-IP关联起来,通过Node-IP再转发到Pod-IP。

不同节点的Pod之间的通信需要将不同节点的bridge0给连接起来。连接不同节点的bridge0的方式有好几种,主要有overlay和underlay,或常规的三层路由。

不同节点的bridge0需要不同的IP段,保证Pod IP分配不会冲突,节点的物理网卡eth0也要和该节点的网桥bridge0连接。因此,节点a上的pod-a发包给节点b上的pod-b,路线如下:

节点a上的pod-a的eth0—>pod-a的vethxxx—>节点a的bridge0—>节点a的eth0—>

节点b的eth0—>节点b的bridge0—>pod-b的vethxxx—>pod-b的eth0

这里写图片描述

1. Pod间实现通信

例如:Pod1和Pod2(同主机),Pod1和Pod3(跨主机)能够通信

实现:因为Pod的Pod-IP是Docker网桥分配的,Pod-IP是同Node下全局唯一的。所以将不同Kubernetes Node的 Docker网桥配置成不同的IP网段即可。

2. Node与Pod间实现通信

例如:Node1和Pod1/ Pod2(同主机),Pod3(跨主机)能够通信

实现:在容器集群中创建一个覆盖网络(Overlay Network),联通各个节点,目前可以通过第三方网络插件来创建覆盖网络,比如Flannel和Open vSwitch等。

不同节点间的Pod访问也可以通过calico形成的Pod IP的路由表来解决。

2.4. Service网络

Service的就是在Pod之间起到服务代理的作用,对外表现为一个单一访问接口,将请求转发给Pod,Service的网络转发是Kubernetes实现服务编排的关键一环。Service都会生成一个虚拟IP,称为Service-IP, Kuberenetes Porxy组件负责实现Service-IP路由和转发,在容器覆盖网络之上又实现了虚拟转发网络。

Kubernetes Porxy实现了以下功能:

  1. 转发访问Service的Service-IP的请求到Endpoints(即Pod-IP)。
  2. 监控Service和Endpoints的变化,实时刷新转发规则。
  3. 负载均衡能力。

3. 开源的网络组件

3.1. Flannel

具体参考Flannel介绍

参考《Kubernetes权威指南》

5.3 - CNI

5.3.1 - CNI接口介绍

1. CNI(Container Network Interface)

CNI(Container Network Interface)即容器网络接口,通过约定统一的容器网络接口,从而kubelet可以通过这个标准的API来调用不同的网络插件实现不同的网络功能。

kubelet启动参数--network-plugin=cni来指定CNI插件,kubelet从--cni-conf-dir (默认是 /etc/cni/net.d) 读取文件并使用 该文件中的 CNI 配置来设置各个 Pod 的网络。 CNI 配置文件必须与 CNI 规约 匹配,并且配置所引用的所有所需的 CNI 插件都应存在于 --cni-bin-dir(默认是 /opt/cni/bin)下。如果有多个CNI配置文件,kubelet 将会使用按文件名的字典顺序排列 的第一个作为配置文件。

CNI规范定义:

  • 网络配置文件的格式

  • 容器runtime与CNI插件的通信协议

  • 基于提供的配置执行网络插件的步骤

  • 网络插件调用其他功能插件的步骤

  • 插件返回给runtime结果的数据格式

2. CNI配置文件格式

CNI配置文件的格式为JSON格式,配置文件的默认路径:/etc/cni/net.d。插件二进制默认的路径为:/opt/cni/bin。

2.1. 主配置的字段

  • cniVersion (string):CNI规范使用的版本,例如版本为0.4.0。

  • name (string):目标网络的名称。

  • disableCheck (boolean):关闭CHECK操作。

  • plugins (list):CNI插件列表及插件配置。

2.2. 插件配置字段

根据不同的插件,插件配置所需的字段不同。

必选字段:

  • type (string):节点上插件二进制的名称,比如bridge,sriov,macvlan等。

可选字段:

  • capabilities (dictionary)

  • ipMasq (boolean):为目标网络配上Outbound Masquerade(地址伪装),即:由容器内部通过网关向外发送数据包时,对数据包的源IP地址进行修改。

    当我们的容器以宿主机作为网关时,这个参数是必须要设置的。否则,从容器内部发出的数据包就没有办法通过网关路由到其他网段。因为容器内部的IP地址无法被目标网段识别,所以这些数据包最终会被丢弃掉。

  • ipam (dictionary):IPAM(IP Adderss Management)即IP地址管理,提供了一系列方法用于对IP和路由进行管理。它对应的是由CNI提供的一组标准IPAM插件,比如像host-local,dhcp,static等。比如文中用到的bridge插件,会调用我们所指定的IPAM插件,实现对网络设备IP地址的分配和管理。**如果是自己开发的ipam插件,则相关的入参可以自己定义和实现。

    以下以host-local为例说明。

    • type:指定所用IPAM插件的名称,在我们的例子里,用的是host-local。
    • subnet:为目标网络分配网段,包括网络ID和子网掩码,以CIDR形式标记。在我们的例子里为10.15.10.0/24,也就是目标网段为10.15.10.0,子网掩码为255.255.255.0
    • routes:用于指定路由规则,插件会为我们在容器的路由表里生成相应的规则。其中,dst表示希望到达的目标网段,以CIDR形式标记。gw对应网关的IP地址,也就是要到达目标网段所要经过的“next hop(下一跳)”。如果省略gw的话,那么插件会自动帮我们选择默认网关。在我们的例子里,gw选择的是默认网关,而dst为0.0.0.0/0则代表“任何网络”,表示数据包将通过默认网关发往任何网络。实际上,这对应的是一条默认路由规则,即:当所有其他路由规则都不匹配时,将选择该路由。
    • rangeStart:允许分配的IP地址范围的起始值
    • rangeEnd:允许分配的IP地址范围的结束值
    • gateway:为网关(也就是我们将要在宿主机上创建的bridge)指定的IP地址。如果省略的话,那么插件会自动从允许分配的IP地址范围内选择起始值作为网关的IP地址。
  • dns (dictionary, optional):dns配置

    • nameservers (list of strings, optional)

    • domain (string, optional)

    • search (list of strings, optional)

    • options (list of strings, optional)

2.3. 配置文件示例

$ mkdir -p /etc/cni/net.d
$ cat >/etc/cni/net.d/10-mynet.conf <<EOF
{
  "cniVersion": "1.0.0",
  "name": "dbnet",
  "plugins": [
    {
      "type": "bridge",
      // plugin specific parameters
      "bridge": "cni0",
      "keyA": ["some more", "plugin specific", "configuration"],

      "ipam": {
        "type": "host-local",
        // ipam specific
        "subnet": "10.1.0.0/16",
        "gateway": "10.1.0.1",
        "routes": [
            {"dst": "0.0.0.0/0"}
        ]
      },
      "dns": {
        "nameservers": [ "10.1.0.1" ]
      }
    },
    {
      "type": "tuning",
      "capabilities": {
        "mac": true
      },
      "sysctl": {
        "net.core.somaxconn": "500"
      }
    },
    {
        "type": "portmap",
        "capabilities": {"portMappings": true}
    }
  ]
}

3. CNI插件

3.1. 安装插件

安装CNI二进制插件,插件下载地:https://github.com/containernetworking/plugins/releases

# 下载二进制
wget https://github.com/containernetworking/plugins/releases/download/v1.1.0/cni-plugins-linux-amd64-v1.1.0.tgz

# 解压文件
tar -zvxf cni-plugins-linux-amd64-v1.1.0.tgz -C /opt/cni/bin/

# 查看解压文件
# ll -h
总用量 63M
-rwxr-xr-x 1 root root 3.7M 2月  24 01:01 bandwidth
-rwxr-xr-x 1 root root 4.1M 2月  24 01:01 bridge
-rwxr-xr-x 1 root root 9.3M 2月  24 01:01 dhcp
-rwxr-xr-x 1 root root 4.2M 2月  24 01:01 firewall
-rwxr-xr-x 1 root root 3.7M 2月  24 01:01 host-device
-rwxr-xr-x 1 root root 3.1M 2月  24 01:01 host-local
-rwxr-xr-x 1 root root 3.8M 2月  24 01:01 ipvlan
-rwxr-xr-x 1 root root 3.2M 2月  24 01:01 loopback
-rwxr-xr-x 1 root root 3.8M 2月  24 01:01 macvlan
-rwxr-xr-x 1 root root 3.6M 2月  24 01:01 portmap
-rwxr-xr-x 1 root root 4.0M 2月  24 01:01 ptp
-rwxr-xr-x 1 root root 3.4M 2月  24 01:01 sbr
-rwxr-xr-x 1 root root 2.7M 2月  24 01:01 static
-rwxr-xr-x 1 root root 3.3M 2月  24 01:01 tuning
-rwxr-xr-x 1 root root 3.8M 2月  24 01:01 vlan
-rwxr-xr-x 1 root root 3.4M 2月  24 01:01 vrf

3.2. 插件分类

参考:https://www.cni.dev/plugins/current/

分类 插件 说明
main bridge Creates a bridge, adds the host and the container to it
ipvlan Adds an ipvlan interface in the container
macvlan Creates a new MAC address, forwards all traffic to that to the container
ptp Creates a veth pair
host-device Moves an already-existing device into a container
vlan Creates a vlan interface off a master
IPAM dhcp Runs a daemon on the host to make DHCP requests on behalf of a container
host-local Maintains a local database of allocated IPs
static Allocates static IPv4/IPv6 addresses to containers
meta tuning Changes sysctl parameters of an existing interface
portmap An iptables-based portmapping plugin. Maps ports from the host’s address space to the container
bandwidth Allows bandwidth-limiting through use of traffic control tbf (ingress/egress)
sbr A plugin that configures source based routing for an interface (from which it is chained)
firewall A firewall plugin which uses iptables or firewalld to add rules to allow traffic to/from the container

4. CNI插件接口

具体可参考:https://github.com/containernetworking/cni/blob/master/SPEC.md#cni-operations

CNI定义的接口操作有:

  • ADD:添加容器网络,在容器启动时调用。
  • DEL:删除容器网络,在容器删除时调用。
  • CHECK:检查容器网络是否正常。
  • VERSION:显示插件版本。

这些操作通过CNI_COMMAND环境变量来传递给CNI插件二进制。

其中环境变量包括:

  • CNI_COMMAND:命令操作,包括 ADD, DEL, CHECK, or VERSION

  • CNI_CONTAINERID:容器的ID,有runtime分配,不为空。

  • CNI_NETNS:容器的网络命名空间,命名空间路径,例如:/run/netns/[nsname]

  • CNI_IFNAME:容器内的网卡名称。

  • CNI_ARGS:其他参数。

  • CNI_PATH:CNI插件二进制的路径。

4.1. ADD接口:添加容器网络

在容器的网络命名空间CNI_NETNS中创建CNI_IFNAME网卡设备,或者调整网卡配置。

必选参数:

  • CNI_COMMAND
  • CNI_CONTAINERID
  • CNI_NETNS
  • CNI_IFNAME

可选参数:

  • CNI_ARGS
  • CNI_PATH

4.2. DEL接口:删除容器网络

删除容器网络命名空间CNI_NETNS中的容器网卡CNI_IFNAME,或者撤销ADD修改操作。

必选参数:

  • CNI_COMMAND
  • CNI_CONTAINERID
  • CNI_IFNAME

可选参数:

  • CNI_NETNS
  • CNI_ARGS
  • CNI_PATH

4.3. CHECK接口:检查容器网络

4.4. VERSION接口:输出CNI的版本

参考:

5.3.2 - Macvlan介绍

1. 简介

macvlan可以看做是物理接口eth(父接口)的子接口,每个macvlan都拥有独立的mac地址,可以被绑定IP作为正常的网卡接口使用。通过这个特性,可以实现在一个物理网络设备绑定多个IP,每个IP拥有独立的mac地址。该特性经常被应用在容器虚拟化中(容器可以配置macvlan的网络,将macvlan interface移动到容器的namespace中)。

示意图:

2. 四种工作模式

2.1. VEPA (Virtual Ethernet Port Aggregator)

VEPA为默认的工作模式,该模式下,所有macvlan发出的流量都会经过父接口,不管目的地是否与该macvlan共用一个父接口。

2.2. Bridge mode

该bridge模式类似于传统的网桥模式,拥有相同父接口的macvlan可以直接进行通信,不需要将数据发给父接口转发。该模式下不需要交换机支持hairpin模式,性能比VEPA模式好。另外相对于传统的网桥模式,该模式不需要学习mac地址,不需要STP,使得其性能比传统的网桥性能好得多。但是如果父接口down掉,则所有子接口也会down,同时无法通信。

2.3. Private mode

该模式是VEPA模式的增强版,

2.4. Passthru mode

待完善

参考:

5.4 - 网络插件

5.4.1 - Flannel介绍

1. flannel是什么(what)

1.1. 概述

Flannel是CoreOS团队针对Kubernetes设计的一个网络规划服务,简单来说,它的功能是让集群中的不同节点主机创建的Docker容器都具有全集群唯一的虚拟IP地址。 Flannel官网:https://github.com/coreos/flannel

1.2. 补充知识点

1、覆盖网络[overlay network]

运行在一个网上的网(应用层网络),并不依靠ip地址来传递消息,而是采用一种映射机制,把ip地址和identifiers做映射来资源定位。

2、路由

互联网是由路由器连接的网络组合而成,路由器按照路由表、路由协议等机制实现对数据包正确地转发,从而到达目标主机。路由器根据数据包中目标主机的IP地址和路由控制表比较得出下一个接收数据的路由器。

1)静态路由:事先设置好路由器和主机中的路由表信息。

静态路由

2)动态路由:让路由协议在运行中自动修改并设置路由表信息。

动态路由

2. 为什么使用flannel(why)

在默认的Docker配置中,每个节点上的Docker服务会分别负责所在节点容器的IP分配。这样导致的一个问题是,不同节点上容器可能获得相同的内外IP地址。

Flannel的设计目的就是为集群中的所有节点重新规划IP地址的使用规则,从而使得不同节点上的容器能够获得“同属一个内网”且”不重复的”IP地址,并让属于不同节点上的容器能够直接通过内网IP通信。

3. 如何实现flannel(how)

Flannel实质上是一种“覆盖网络(overlay network)”,也就是将TCP数据包装在另一种网络包里面进行路由转发和通信,目前已经支持UDP、VxLAN、AWS VPC和GCE路由等数据转发方式,默认的节点间数据通信方式是UDP转发。

3.1. flannel原理图

flannel

  1. 数据从源容器中发出后,经由所在主机的docker0虚拟网卡转发到flannel0虚拟网卡,这是个P2P的虚拟网卡,flanneld服务监听在网卡的另外一端。
  2. Flannel通过Etcd服务维护了一张节点间的路由表。
  3. 源主机的flanneld服务将原本的数据内容UDP封装后根据自己的路由表投递给目的节点的flanneld服务,数据到达以后被解包,然后直 接进入目的节点的flannel0虚拟网卡,然后被转发到目的主机的docker0虚拟网卡,最后就像本机容器通信一下的有docker0路由到达目标容 器。

3.2. 实现说明

1、UDP封装

原始数据是在起始节点的Flannel服务上进行UDP封装的,投递到目的节点后就被另一端的Flannel服务还原成了原始的数据包,两边的Docker服务都感觉不到这个过程的存在。 UDP的数据内容部分其实是另一个ICMP(也就是ping命令)的数据包。

UDP封装

2、为docker分配不同的IP段

Flannel通过Etcd分配了每个节点可用的IP地址段后,偷偷的修改了Docker的启动参数。

docker启动参数

注意其中的“--bip=172.17.18.1/24”这个参数,它限制了所在节点容器获得的IP范围。

这个IP范围是由Flannel自动分配的,由Flannel通过保存在Etcd服务中的记录确保它们不会重复。

3、路由规则

1)数据发送节点的路由表

数据发送节点路由表

2)数据接收节点的路由表

数据接收节点路由表

例如现在有一个数据包要从IP为172.17.18.2的容器发到IP为172.17.46.2的容器。根据数据发送节点的路由表,它只与 172.17.0.0/16匹配这条记录匹配,因此数据从docker0出来以后就被投递到了flannel0。同理在目标节点,由于投递的地址是一个容 器,因此目的地址一定会落在docker0对于的172.17.46.0/24这个记录上,自然的被投递到了docker0网卡。

3.3. flannel的安装与配置

1、安装

wget http://<官网>/flannel/flannel-0.2.0-10.el7.x86_64.rpm
yum localinstall -y flannel-0.2.0-10.el7.x86_64.rpm

2、配置

vi /etc/sysconfig/flanneld

# Flanneld configuration options
 
# etcd url location. Point this to the server where etcd runs
FLANNEL_ETCD="http://127.0.0.1:4001"
  
# etcd config key. This is the configuration key that flannel queries
# For address range assignment
FLANNEL_ETCD_KEY="/xxx/flannel/product/network"
  
# Any additional options that you want to pass
FLANNEL_OPTIONS=" -iface=eth0"

3、初始化flannel的etcd配置

etcdctl set /xxx/flannel/network/config '{
   "Network": "10.0.0.0/16",
   "Backend": {
       "Type": "vxlan"
   }
}'

6 - 容器存储

6.1 - 存储卷概念

6.1.1 -

Volume

1. volume概述

  • 容器上的文件生命周期同容器的生命周期一致,即容器挂掉之后,容器将会以最初镜像中的文件系统内容启动,之前容器运行时产生的文件将会丢失。
  • Pod的volume的生命周期同Pod的生命周期一致,当Pod被删除的时候,对应的volume才会被删除。即Pod中的容器重启时,之前的文件仍可以保存。

容器中的进程看到的是由其 Docker 镜像和卷组成的文件系统视图。

Pod volume的使用方式

Pod 中的每个容器都必须独立指定每个卷的挂载位置,需要给Pod配置volume相关参数。

Pod的volume关键字段如下:

  • spec.volumes:提供怎样的数据卷
  • spec.containers.volumeMounts:挂载到容器的什么路径

2. volume类型

2.1. emptyDir

1、特点

  • 会创建emptyDir对应的目录,默认为空(如果该目录原来有文件也会被重置为空)
  • Pod中的不同容器可以在目录中读写相同文件(即Pod中的不同容器可以通过该方式来共享文件)
  • 当Pod被删除,emptyDir 中的数据将被永久删除,如果只是Pod挂掉该数据还会保留

2、使用场景

  • 不同容器之间共享文件(例如日志采集等)
  • 暂存空间,例如用于基于磁盘的合并排序
  • 用作长时间计算崩溃恢复时的检查点

3、示例

apiVersion: v1
kind: Pod
metadata:
  name: test-pd
spec:
  containers:
  - image: k8s.gcr.io/test-webserver
    name: test-container
    volumeMounts:
    - mountPath: /cache
      name: cache-volume
  volumes:
  - name: cache-volume
    emptyDir: {}

2.2. hostPath

1、特点

  • 会将宿主机的目录或文件挂载到Pod中

2、使用场景

  • 运行需要访问 Docker 内部的容器;使用 /var/lib/dockerhostPath

  • 在容器中运行 cAdvisor;使用 /dev/cgroupshostPath

  • 其他使用到宿主机文件的场景

hostPathtype字段

行为
空字符串(默认)用于向后兼容,这意味着在挂载 hostPath 卷之前不会执行任何检查。
DirectoryOrCreate 如果在给定的路径上没有任何东西存在,那么将根据需要在那里创建一个空目录,权限设置为 0755,与 Kubelet 具有相同的组和所有权。
Directory 给定的路径下必须存在目录
FileOrCreate 如果在给定的路径上没有任何东西存在,那么会根据需要创建一个空文件,权限设置为 0644,与 Kubelet 具有相同的组和所有权。
File 给定的路径下必须存在文件
Socket 给定的路径下必须存在 UNIX 套接字
CharDevice 给定的路径下必须存在字符设备
BlockDevice 给定的路径下必须存在块设备

注意事项

  • 由于每个节点上的文件都不同,具有相同配置的 pod 在不同节点上的行为可能会有所不同
  • 当 Kubernetes 按照计划添加资源感知调度时,将无法考虑 hostPath 使用的资源
  • 在底层主机上创建的文件或目录只能由 root 写入。您需要在特权容器中以 root 身份运行进程,或修改主机上的文件权限以便写入 hostPath

3、示例

apiVersion: v1
kind: Pod
metadata:
  name: test-pd
spec:
  containers:
  - image: k8s.gcr.io/test-webserver
    name: test-container
    volumeMounts:
    - mountPath: /test-pd
      name: test-volume
  volumes:
  - name: test-volume
    hostPath:
      # directory location on host
      path: /data
      # this field is optional
      type: Directory

2.3. configMap

configMap提供了一种给Pod注入配置文件的方式,配置文件内容存储在configMap对象中,如果Pod使用configMap作为volume的类型,需要先创建configMap的对象。

示例

apiVersion: v1
kind: Pod
metadata:
  name: configmap-pod
spec:
  containers:
    - name: test
      image: busybox
      volumeMounts:
        - name: config-vol
          mountPath: /etc/config
  volumes:
    - name: config-vol
      configMap:
        name: log-config
        items:
          - key: log_level
            path: log_level

2.4. cephfs

cephfs的方式将Pod的存储挂载到ceph集群中,通过外部存储的方式持久化Pod的数据(即当Pod被删除数据仍可以存储在ceph集群中),前提是先部署和维护好一个ceph集群。

示例

apiVersion: v1
kind: Pod
metadata:
  name: cephfs
spec:
  containers:
  - name: cephfs-rw
    image: kubernetes/pause
    volumeMounts:
    - mountPath: "/mnt/cephfs"
      name: cephfs
  volumes:
  - name: cephfs
    cephfs:
      monitors:
      - 10.16.154.78:6789
      - 10.16.154.82:6789
      - 10.16.154.83:6789
      # by default the path is /, but you can override and mount a specific path of the filesystem by using the path attribute
      # path: /some/path/in/side/cephfs 
      user: admin
      secretFile: "/etc/ceph/admin.secret"
      readOnly: true

更多可参考 CephFS 示例

2.5. nfs

nfs的方式类似cephfs,即将Pod数据存储到NFS集群中,具体可参考NFS示例

2.6. persistentVolumeClaim

persistentVolumeClaim 卷用于将PersistentVolume挂载到容器中。PersistentVolumes 是在用户不知道特定云环境的细节的情况下“声明”持久化存储(例如 GCE PersistentDisk 或 iSCSI 卷)的一种方式。

参考文章:

6.1.2 -

PersistentVolume

1. PV概述

PersistentVolume(简称PV) 是 Volume 之类的卷插件,也是集群中的资源,但独立于Pod的生命周期(即不会因Pod删除而被删除),不归属于某个Namespace。

2. PV和PVC的生命周期

2.1. 配置(Provision)

有两种方式来配置 PV:静态或动态。

1、静态

手动创建PV,可供k8s集群中的对象消费。

2、动态

可以通过StorageClass和具体的Provisioner(例如nfs-client-provisioner)来动态地创建和删除PV。

2.2. 绑定

在动态配置的情况下,用户创建了特定的PVC,k8s会监听新的PVC,并寻找匹配的PV绑定。一旦绑定后,这种绑定是排他性的,PVC和PV的绑定是一对一的映射。

2.3. 使用

Pod 使用PVC作为卷。集群检查PVC以查找绑定的卷并为集群挂载该卷。用户通过在 Pod 的 volume 配置中包含 persistentVolumeClaim 来调度 Pod 并访问用户声明的 PV。

2.4. 回收

PV的回收策略可以设定PVC在释放后如何处理对应的Volume,目前有 RetainedRecycled Deleted三种策略。

1、保留(Retain)

保留策略允许手动回收资源,当删除PVC的时候,PV仍然存在,可以通过以下步骤回收卷:

  1. 删除PV
  2. 手动清理外部存储的数据资源
  3. 手动删除或重新使用关联的存储资产

2、回收(Resycle)

该策略已废弃,推荐使用dynamic provisioning

回收策略会在 volume上执行基本擦除(rm -rf / thevolume / *),可被再次声明使用。

3、删除(Delete)

删除策略,当发生删除操作的时候,会从k8s集群中删除PV对象,并执行外部存储资源的删除操作(根据不同的provisioner定义的删除逻辑不同,有的是重命名)。

动态配置的卷继承其StorageClass的回收策略,默认为Delete,即当用户删除PVC的时候,会自动执行PV的删除策略。

如果要修改PV的回收策略,可执行以下命令:

# Get pv 
kubectl get pv
# Change policy to Retaion
kubectl patch pv <pv_name> -p ‘{“spec”:{“persistentVolumeReclaimPolicy”:“Retain”}}

3. PV的类型

PersistentVolume 类型以插件形式实现。以下仅列部分常用类型:

  • GCEPersistentDisk
  • AWSElasticBlockStore
  • NFS
  • RBD (Ceph Block Device)
  • CephFS
  • Glusterfs

4. PV的属性

每个 PV 配置中都包含一个 sepc 规格字段和一个 status 卷状态字段。

apiVersion: v1
kind: PersistentVolume
metadata:
  annotations:
    pv.kubernetes.io/provisioned-by: fuseim.pri/ifs
  creationTimestamp: 2018-07-12T06:46:48Z
  name: default-test-web-0-pvc-58cf5ec1-859f-11e8-bb61-005056b83985
  resourceVersion: "100163256"
  selfLink: /api/v1/persistentvolumes/default-test-web-0-pvc-58cf5ec1-859f-11e8-bb61-005056b83985
  uid: 59796ba3-859f-11e8-9c50-c81f66bcff65
spec:
  accessModes:
  - ReadWriteOnce
  capacity:
    storage: 2Gi
  volumeMode: Filesystem  
  claimRef:
    apiVersion: v1
    kind: PersistentVolumeClaim
    name: test-web-0
    namespace: default
    resourceVersion: "100163248"
    uid: 58cf5ec1-859f-11e8-bb61-005056b83985
  nfs:
    path: /data/nfs-storage/default-test-web-0-pvc-58cf5ec1-859f-11e8-bb61-005056b83985
    server: 172.16.201.54
  persistentVolumeReclaimPolicy: Delete
  storageClassName: managed-nfs-storage
  mountOptions:
    - hard
    - nfsvers=4.1
status:
  phase: Bound

4.1. Capacity

给PV设置特定的存储容量,更多 capacity 可参考Kubernetes 资源模型 。

4.2. Volume Mode

 volumeMode 的有效值可以是FilesystemBlock。如果未指定,volumeMode 将默认为Filesystem

4.3. Access Modes

访问模式包括:

  • ReadWriteOnce——该卷可以被单个节点以读/写模式挂载
  • ReadOnlyMany——该卷可以被多个节点以只读模式挂载
  • ReadWriteMany——该卷可以被多个节点以读/写模式挂载

在命令行中,访问模式缩写为:

  • RWO - ReadWriteOnce
  • ROX - ReadOnlyMany
  • RWX - ReadWriteMany

一个卷一次只能使用一种访问模式挂载,即使它支持很多访问模式。

以下只列举部分常用插件:

Volume 插件 ReadWriteOnce ReadOnlyMany ReadWriteMany
AWSElasticBlockStore - -
CephFS
GCEPersistentDisk -
Glusterfs
HostPath - -
NFS
RBD -
... -

4.4. Class

PV可以指定一个StorageClass来动态绑定PV和PVC,其中通过 storageClassName 属性来指定具体的StorageClass,如果没有指定该属性的PV,它只能绑定到不需要特定类的 PVC。

4.5. Reclaim Policy

回收策略包括:

  • Retain(保留)——手动回收
  • Recycle(回收)——基本擦除(rm -rf /thevolume/*
  • Delete(删除)——关联的存储资产(例如 AWS EBS、GCE PD、Azure Disk 和 OpenStack Cinder 卷)将被删除

当前,只有 NFS 和 HostPath 支持回收策略。AWS EBS、GCE PD、Azure Disk 和 Cinder 卷支持删除策略。

4.6. Mount Options

Kubernetes 管理员可以指定在节点上为挂载持久卷指定挂载选项。

注意:不是所有的持久化卷类型都支持挂载选项。

支持挂载选项常用的类型有:

  • GCEPersistentDisk
  • AWSElasticBlockStore
  • AzureFile
  • AzureDisk
  • NFS
  • RBD (Ceph Block Device)
  • CephFS
  • Cinder (OpenStack 卷存储)
  • Glusterfs

4.7. Phase

PV可以处于以下的某种状态:

  • Available(可用)——一块空闲资源还没有被任何声明绑定
  • Bound(已绑定)——卷已经被声明绑定
  • Released(已释放)——声明被删除,但是资源还未被集群重新声明
  • Failed(失败)——该卷的自动回收失败

命令行会显示绑定到 PV 的 PVC 的名称。

参考文章:

6.1.3 -

PersistentVolumeClaim

1. PVC概述

PersistentVolumeClaim(简称PVC)是用户存储的请求,PVC消耗PV的资源,可以请求特定的大小和访问模式,需要指定归属于某个Namespace,在同一个Namespace的Pod才可以指定对应的PVC。

当需要不同性质的PV来满足存储需求时,可以使用StorageClass来实现。

每个 PVC 中都包含一个 spec 规格字段和一个 status 声明状态字段。

kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: myclaim
spec:
  accessModes:
    - ReadWriteOnce
  volumeMode: Filesystem
  resources:
    requests:
      storage: 8Gi
  storageClassName: slow
  selector:
    matchLabels:
      release: "stable"
    matchExpressions:
      - {key: environment, operator: In, values: [dev]}

2. PVC的属性

2.1. accessModes

对应存储的访问模式,例如:ReadWriteOnce

2.2. volumeMode

对应存储的数据卷模式,例如:Filesystem

2.3. resources

声明可以请求特定数量的资源。相同的资源模型适用于Volume和PVC。

2.4. selector

声明label selector,只有标签与选择器匹配的卷可以绑定到声明。

  • matchLabels:volume 必须有具有该值的标签
  • matchExpressions:条件列表,通过条件表达式筛选匹配的卷。有效的运算符包括 In、NotIn、Exists 和 DoesNotExist。

2.5. storageClassName

通过storageClassName参数来指定使用对应名字的StorageClass,只有所请求的类与 PVC 具有相同 storageClassName 的 PV 才能绑定到 PVC。

PVC可以不指定storageClassName,或者将该值设置为空,如果打开了准入控制插件,并且指定一个默认的 StorageClass,则PVC会使用默认的StorageClass,否则就绑定到没有StorageClass的 PV上。

之前使用注解 volume.beta.kubernetes.io/storage-class 而不是 storageClassName 属性。这个注解仍然有效,但是在未来的 Kubernetes 版本中不会支持。

3. 将PVC作为Volume

将PVC作为Pod的Volume,PVC与Pod需要在同一个命名空间下,其实Pod的声明如下:

kind: Pod
apiVersion: v1
metadata:
  name: mypod
spec:
  containers:
    - name: myfrontend
      image: dockerfile/nginx
      volumeMounts:
      - mountPath: "/var/www/html"
        name: mypd
  volumes:
    - name: mypd
      persistentVolumeClaim:    # 使用PVC
        claimName: myclaim

PersistentVolumes 绑定是唯一的,并且由于 PersistentVolumeClaims 是命名空间对象,因此只能在一个命名空间内挂载具有“多个”模式(ROXRWX)的PVC。

参考文章:

6.1.4 -

StorageClass

1. StorageClass概述

StorageClass提供了一种描述存储类(class)的方法,不同的class可能会映射到不同的服务质量等级和备份策略或其他策略等。

StorageClass 对象中包含 provisionerparameters 和 reclaimPolicy 字段,当需要动态分配 PersistentVolume 时会使用到。当创建 StorageClass 对象时,设置名称和其他参数,一旦创建了对象就不能再对其更新。也可以为没有申请绑定到特定 class 的 PVC 指定一个默认的 StorageClass 。

StorageClass对象文件

kind: StorageClass
apiVersion: storage.k8s.io/v3
metadata:
  name: standard
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp2
reclaimPolicy: Retain
mountOptions:
  - debug

2. StorageClass的属性

2.1. Provisioner(存储分配器)

Storage class 有一个分配器(provisioner),用来决定使用哪个卷插件分配 PV,该字段必须指定。可以指定内部分配器,也可以指定外部分配器。外部分配器的代码地址为: kubernetes-incubator/external-storage,其中包括NFSCeph等。

2.2. Reclaim Policy(回收策略)

可以通过reclaimPolicy字段指定创建的Persistent Volume的回收策略,回收策略包括:Delete 或者 Retain,没有指定默认为Delete

2.3. Mount Options(挂载选项)

由 storage class 动态创建的 Persistent Volume 将使用 class 中 mountOptions 字段指定的挂载选项。

2.4. 参数

Storage class 具有描述属于 storage class 卷的参数。取决于分配器,可以接受不同的参数。 当参数被省略时,会使用默认值。

例如以下使用Ceph RBD

kind: StorageClass
apiVersion: storage.k8s.io/v3
metadata:
  name: fast
provisioner: kubernetes.io/rbd
parameters:
  monitors: 30.36.353.305:6789
  adminId: kube
  adminSecretName: ceph-secret
  adminSecretNamespace: kube-system
  pool: kube
  userId: kube
  userSecretName: ceph-secret-user
  fsType: ext4
  imageFormat: "2"
  imageFeatures: "layering"

对应的参数说明

  • monitors:Ceph monitor,逗号分隔。该参数是必需的。

  • adminId:Ceph 客户端 ID,用于在池(ceph pool)中创建映像。 默认是 “admin”。

  • adminSecretNamespace:adminSecret 的 namespace。默认是 “default”。

  • adminSecret:adminId 的 Secret 名称。该参数是必需的。 提供的 secret 必须有值为 “kubernetes.io/rbd” 的 type 参数。

  • pool: Ceph RBD 池. 默认是 “rbd”。

  • userId:Ceph 客户端 ID,用于映射 RBD 镜像(RBD image)。默认与 adminId 相同。

  • userSecretName:用于映射 RBD 镜像的 userId 的 Ceph Secret 的名字。 它必须与 PVC 存在于相同的 namespace 中。该参数是必需的。 提供的 secret 必须具有值为 “kubernetes.io/rbd” 的 type 参数,例如以这样的方式创建:

    kubectl create secret generic ceph-secret --type="kubernetes.io/rbd" \
      --from-literal=key='QVFEQ1pMdFhPUnQrSmhBQUFYaERWNHJsZ3BsMmNjcDR6RFZST0E9PQ==' \
      --namespace=kube-system
    
  • fsType:Kubernetes 支持的 fsType。默认:"ext4"。

  • imageFormat:Ceph RBD 镜像格式,”1” 或者 “2”。默认值是 “1”。

  • imageFeatures:这个参数是可选的,只能在你将 imageFormat 设置为 “2” 才使用。 目前支持的功能只是 layering。 默认是 ““,没有功能打开。

参考文章:

6.1.5 -

Dynamic Volume Provisioning

Dynamic volume provisioning允许用户按需自动创建存储卷,这种方式可以让用户不需要关心存储的复杂性和差别,又可以选择不同的存储类型。

1. 开启Dynamic Provisioning

需要先提前创建StorageClass对象,StorageClass中定义了使用哪个provisioner,并且在provisioner被调用时传入哪些参数,具体可参考StorageClass介绍。

例如:

  • 磁盘类存储
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: slow
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-standard
  • SSD类存储
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast
provisioner: kubernetes.io/gce-pd
parameters:
  type: pd-ssd

2. 使用Dynamic Provisioning

创建一个PVC对象,并且在其中storageClassName字段指明需要用到的StorageClass的名称,例如:

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: claim1
spec:
  accessModes:
    - ReadWriteOnce
  storageClassName: fast
  resources:
    requests:
      storage: 30Gi

当使用到PVC的时候会自动创建对应的外部存储,当PVC被删除的时候,会自动销毁(或备份)外部存储。

3. 默认的StorageClass

当没有对应的StorageClass配置时,可以设定默认的StorageClass,需要执行以下操作:

可以通过添加storageclass.kubernetes.io/is-default-class注解的方式设置某个StorageClass为默认的StorageClass。当用户创建了一个PersistentVolumeClaim,但没有指定storageClassName的时候,会自动将该PVC的storageClassName指向默认的StorageClass

参考文章:

6.2 - CSI

6.2.1 - csi-cephfs-plugin

1. 编译CSI CephFS plugin

CSI CephFS plugin用来提供CephFS存储卷和挂载存储卷,源码参考:https://github.com/ceph/ceph-csi

1.1. 编译二进制

$ make cephfsplugin

1.2. 编译Docker镜像

$ make image-cephfsplugin

2. 配置项

2.1. 命令行参数

Option Default value Description
--endpoint unix://tmp/csi.sock CSI endpoint, must be a UNIX socket
--drivername csi-cephfsplugin name of the driver (Kubernetes: provisioner field in StorageClass must correspond to this value)
--nodeid empty This node’s ID
--volumemounter empty default volume mounter. Available options are kernel and fuse. This is the mount method used if volume parameters don’t specify otherwise. If left unspecified, the driver will first probe for ceph-fuse in system’s path and will choose Ceph kernel client if probing failed.

2.2. volume参数

Parameter Required Description
monitors yes Comma separated list of Ceph monitors (e.g. 192.168.100.1:6789,192.168.100.2:6789,192.168.100.3:6789)
mounter no Mount method to be used for this volume. Available options are kernel for Ceph kernel client and fuse for Ceph FUSE driver. Defaults to “default mounter”, see command line arguments.
provisionVolume yes Mode of operation. BOOL value. If true, a new CephFS volume will be provisioned. If false, an existing CephFS will be used.
pool for provisionVolume=true Ceph pool into which the volume shall be created
rootPath for provisionVolume=false Root path of an existing CephFS volume
csiProvisionerSecretName, csiNodeStageSecretName for Kubernetes name of the Kubernetes Secret object containing Ceph client credentials. Both parameters should have the same value
csiProvisionerSecretNamespace, csiNodeStageSecretNamespace for Kubernetes namespaces of the above Secret objects

2.3. provisionVolume

2.3.1. 管理员密钥认证

provisionVolume=true时,必要的管理员认证参数如下:

  • adminID: ID of an admin client
  • adminKey: key of the admin client

2.3.2. 普通用户密钥认证

provisionVolume=false时,必要的用户认证参数如下:

  • userID: ID of a user client
  • userKey: key of a user client

参考文章:

6.2.2 - 部署csi-cephfs

0. 说明

要求Kubernetes的版本在1.11及以上,k8s集群必须允许特权Pod(privileged pods),即apiserver和kubelet需要设置--allow-privilegedtrue。节点的Docker daemon需要允许挂载共享卷。

涉及镜像

  • quay.io/k8scsi/csi-provisioner:v0.3.0
  • quay.io/k8scsi/csi-attacher:v0.3.0
  • quay.io/k8scsi/driver-registrar:v0.3.0
  • quay.io/cephcsi/cephfsplugin:v0.3.0

1. 部署RBAC

部署service accounts, cluster rolescluster role bindings,这些可供RBDCephFS CSI plugins共同使用,他们拥有相同的权限。

$ kubectl create -f csi-attacher-rbac.yaml
$ kubectl create -f csi-provisioner-rbac.yaml
$ kubectl create -f csi-nodeplugin-rbac.yaml

1.1. csi-attacher-rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: csi-attacher

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: external-attacher-runner
rules:
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "watch"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["volumeattachments"]
    verbs: ["get", "list", "watch", "update"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: csi-attacher-role
subjects:
  - kind: ServiceAccount
    name: csi-attacher
    namespace: default
roleRef:
  kind: ClusterRole
  name: external-attacher-runner
  apiGroup: rbac.authorization.k8s.io

1.2. csi-provisioner-rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: csi-provisioner

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: external-provisioner-runner
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["list", "watch", "create", "update", "patch"]
    
---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: csi-provisioner-role
subjects:
  - kind: ServiceAccount
    name: csi-provisioner
    namespace: default
roleRef:
  kind: ClusterRole
  name: external-provisioner-runner
  apiGroup: rbac.authorization.k8s.io

1.3. csi-nodeplugin-rbac.yaml

apiVersion: v1
kind: ServiceAccount
metadata:
  name: csi-nodeplugin

---
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: csi-nodeplugin
rules:
  - apiGroups: [""]
    resources: ["nodes"]
    verbs: ["get", "list", "update"]
  - apiGroups: [""]
    resources: ["namespaces"]
    verbs: ["get", "list"]
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["volumeattachments"]
    verbs: ["get", "list", "watch", "update"]

---
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: csi-nodeplugin
subjects:
  - kind: ServiceAccount
    name: csi-nodeplugin
    namespace: default
roleRef:
  kind: ClusterRole
  name: csi-nodeplugin
  apiGroup: rbac.authorization.k8s.io          

2. 部署CSI sidecar containers

通过StatefulSet的方式部署external-attacherexternal-provisionerCSI CephFS使用。

$ kubectl create -f csi-cephfsplugin-attacher.yaml
$ kubectl create -f csi-cephfsplugin-provisioner.yaml

2.1. csi-cephfsplugin-provisioner.yaml

kind: Service
apiVersion: v1
metadata:
  name: csi-cephfsplugin-provisioner
  labels:
    app: csi-cephfsplugin-provisioner
spec:
  selector:
    app: csi-cephfsplugin-provisioner
  ports:
    - name: dummy
      port: 12345

---
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
  name: csi-cephfsplugin-provisioner
spec:
  serviceName: "csi-cephfsplugin-provisioner"
  replicas: 1
  template:
    metadata:
      labels:
        app: csi-cephfsplugin-provisioner
    spec:
      serviceAccount: csi-provisioner
      containers:
        - name: csi-provisioner
          image: quay.io/k8scsi/csi-provisioner:v0.3.0
          args:
            - "--provisioner=csi-cephfsplugin"
            - "--csi-address=$(ADDRESS)"
            - "--v=5"
          env:
            - name: ADDRESS
              value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: socket-dir
              mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
      volumes:
        - name: socket-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi-cephfsplugin
            type: DirectoryOrCreate

2.2. csi-cephfsplugin-attacher.yaml

kind: Service
apiVersion: v1
metadata:
  name: csi-cephfsplugin-attacher
  labels:
    app: csi-cephfsplugin-attacher
spec:
  selector:
    app: csi-cephfsplugin-attacher
  ports:
    - name: dummy
      port: 12345

---
kind: StatefulSet
apiVersion: apps/v1beta1
metadata:
  name: csi-cephfsplugin-attacher
spec:
  serviceName: "csi-cephfsplugin-attacher"
  replicas: 1
  template:
    metadata:
      labels:
        app: csi-cephfsplugin-attacher
    spec:
      serviceAccount: csi-attacher
      containers:
        - name: csi-cephfsplugin-attacher
          image: quay.io/k8scsi/csi-attacher:v0.3.0
          args:
            - "--v=5"
            - "--csi-address=$(ADDRESS)"
          env:
            - name: ADDRESS
              value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: socket-dir
              mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
      volumes:
        - name: socket-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi-cephfsplugin
            type: DirectoryOrCreate

3. 部署CSI-CephFS-driver(plugin)

csi-cephfs-plugin 的作用类似nfs-client,部署在所有node节点上,执行ceph的挂载等相关任务。

通过DaemonSet的方式部署,其中包括两个容器:CSI driver-registrarCSI CephFS driver

$ kubectl create -f csi-cephfsplugin.yaml

3.1. csi-cephfsplugin.yaml

kind: DaemonSet
apiVersion: apps/v1beta2
metadata:
  name: csi-cephfsplugin
spec:
  selector:
    matchLabels:
      app: csi-cephfsplugin
  template:
    metadata:
      labels:
        app: csi-cephfsplugin
    spec:
      serviceAccount: csi-nodeplugin
      hostNetwork: true
      # to use e.g. Rook orchestrated cluster, and mons' FQDN is
      # resolved through k8s service, set dns policy to cluster first
      dnsPolicy: ClusterFirstWithHostNet      
      containers:
        - name: driver-registrar
          image: quay.io/k8scsi/driver-registrar:v0.3.0
          args:
            - "--v=5"
            - "--csi-address=$(ADDRESS)"
            - "--kubelet-registration-path=$(DRIVER_REG_SOCK_PATH)"
          env:
            - name: ADDRESS
              value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
            - name: DRIVER_REG_SOCK_PATH
              value: /var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
            - name: KUBE_NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          volumeMounts:
            - name: socket-dir
              mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
            - name: registration-dir
              mountPath: /registration
        - name: csi-cephfsplugin
          securityContext:
            privileged: true
            capabilities:
              add: ["SYS_ADMIN"]
            allowPrivilegeEscalation: true
          image: quay.io/cephcsi/cephfsplugin:v0.3.0
          args :
            - "--nodeid=$(NODE_ID)"
            - "--endpoint=$(CSI_ENDPOINT)"
            - "--v=5"
            - "--drivername=csi-cephfsplugin"
          env:
            - name: NODE_ID
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
            - name: CSI_ENDPOINT
              value: unix://var/lib/kubelet/plugins/csi-cephfsplugin/csi.sock
          imagePullPolicy: "IfNotPresent"
          volumeMounts:
            - name: plugin-dir
              mountPath: /var/lib/kubelet/plugins/csi-cephfsplugin
            - name: pods-mount-dir
              mountPath: /var/lib/kubelet/pods
              mountPropagation: "Bidirectional"
            - mountPath: /sys
              name: host-sys
            - name: lib-modules
              mountPath: /lib/modules
              readOnly: true
            - name: host-dev
              mountPath: /dev
      volumes:
        - name: plugin-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi-cephfsplugin
            type: DirectoryOrCreate
        - name: registration-dir
          hostPath:
            path: /var/lib/kubelet/plugins/
            type: Directory
        - name: pods-mount-dir
          hostPath:
            path: /var/lib/kubelet/pods
            type: Directory
        - name: socket-dir
          hostPath:
            path: /var/lib/kubelet/plugins/csi-cephfsplugin
            type: DirectoryOrCreate
        - name: host-sys
          hostPath:
            path: /sys
        - name: lib-modules
          hostPath:
            path: /lib/modules
        - name: host-dev
          hostPath:
            path: /dev

4. 确认部署结果

$ kubectl get all
NAME                                 READY     STATUS    RESTARTS   AGE
pod/csi-cephfsplugin-attacher-0      1/1       Running   0          26s
pod/csi-cephfsplugin-provisioner-0   1/1       Running   0          25s
pod/csi-cephfsplugin-rljcv           2/2       Running   0          24s

NAME                                   TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)     AGE
service/csi-cephfsplugin-attacher      ClusterIP   10.104.116.218   <none>        12345/TCP   27s
service/csi-cephfsplugin-provisioner   ClusterIP   10.101.78.75     <none>        12345/TCP   26s

...

参考文档:

6.2.3 - 部署cephfs-provisioner

1. 安装cephfs客户端

所有node节点安装cephfs客户端,主要用来和ceph集群挂载使用。

yum install -y ceph-common

2. 部署RBAC

2.1. ClusterRole

kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cephfs-provisioner
  namespace: cephfs
rules:
  - apiGroups: [""]
    resources: ["persistentvolumes"]
    verbs: ["get", "list", "watch", "create", "delete"]
  - apiGroups: [""]
    resources: ["persistentvolumeclaims"]
    verbs: ["get", "list", "watch", "update"]
  - apiGroups: ["storage.k8s.io"]
    resources: ["storageclasses"]
    verbs: ["get", "list", "watch"]
  - apiGroups: [""]
    resources: ["events"]
    verbs: ["create", "update", "patch"]
  - apiGroups: [""]
    resources: ["services"]
    resourceNames: ["kube-dns","coredns"]
    verbs: ["list", "get"]

2.2. ClusterRoleBinding

kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: cephfs-provisioner
subjects:
  - kind: ServiceAccount
    name: cephfs-provisioner
    namespace: cephfs
roleRef:
  kind: ClusterRole
  name: cephfs-provisioner
  apiGroup: rbac.authorization.k8s.io

2.3. Role

apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  name: cephfs-provisioner
  namespace: cephfs
rules:
  - apiGroups: [""]
    resources: ["secrets"]
    verbs: ["create", "get", "delete"]
  - apiGroups: [""]
    resources: ["endpoints"]
    verbs: ["get", "list", "watch", "create", "update", "patch"]

2.4. RoleBinding

apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: cephfs-provisioner
  namespace: cephfs
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: Role
  name: cephfs-provisioner
subjects:
- kind: ServiceAccount
  name: cephfs-provisioner

2.5. ServiceAccount

apiVersion: v1
kind: ServiceAccount
metadata:
  name: cephfs-provisioner
  namespace: cephfs

3. 部署 cephfs-provisioner

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: cephfs-provisioner
  namespace: cephfs
spec:
  replicas: 1
  strategy:
    type: Recreate
  template:
    metadata:
      labels:
        app: cephfs-provisioner
    spec:
      containers:
      - name: cephfs-provisioner
        image: "quay.io/external_storage/cephfs-provisioner:latest"
        resources:
          limits:
            cpu: 500m
            memory: 512Mi
          requests:
            cpu: 100m
            memory: 64Mi        
        env:
        - name: PROVISIONER_NAME                # 与storageclass的provisioner参数相同
          value: ceph.com/cephfs
        - name: PROVISIONER_SECRET_NAMESPACE    # 与rbac的namespace相同
          value: cephfs
        command:
        - "/usr/local/bin/cephfs-provisioner"
        args:
        - "-id=cephfs-provisioner-1"
      serviceAccount: cephfs-provisioner

4. 部署storageclass

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
   name: cephfs-provisioner-sc
provisioner: ceph.com/cephfs
volumeBindingMode: WaitForFirstConsumer
parameters:
  monitors: 192.168.27.43:6789,192.168.27.44:6789,192.168.27.45:6789
  adminId: admin
  adminSecretName: csi-cephfs-secret
  adminSecretNamespace: "kube-csi"
  claimRoot: /pvc-volumes

5. 部署statefulset

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: cephfs-provisioner-nginx
spec:
  serviceName: "nginx"
  replicas: 1
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - name: nginx
        image: nginx:latest   #nginx的镜像
        imagePullPolicy: IfNotPresent
        volumeMounts:
        - mountPath: "/mnt"      #容器里面的挂载目录,该目录挂载到NFS的共享目录上
          name: test
  volumeClaimTemplates:
  - metadata:
      name: test
    spec:
      accessModes: [ "ReadWriteOnce" ]
      resources:
        requests:
          storage: 2Gi
      storageClassName: cephfs-provisioner-sc

6. 日志

6.1. cephfs-provisoner 执行日志

I0327 07:18:19.742239       1 controller.go:987] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": started
I0327 07:18:19.745239       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-22-0", UID:"7f6b60d5-5060-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347214256", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test-cephfs-ngx-wait-22-0"
I0327 07:18:23.281277       1 cephfs-provisioner.go:222] successfully created CephFS share &CephFSPersistentVolumeSource{Monitors:[192.168.27.43:6789 192.168.27.44:6789 192.168.27.45:6789],Path:/pvc-volumes/kubernetes/kubernetes-dynamic-pvc-7f7cb62f-5060-11e9-85c0-0adb8ef08100,User:kubernetes-dynamic-user-7f7cb69f-5060-11e9-85c0-0adb8ef08100,SecretFile:,SecretRef:&SecretReference{Name:ceph-kubernetes-dynamic-user-7f7cb69f-5060-11e9-85c0-0adb8ef08100-secret,Namespace:default,},ReadOnly:false,}
I0327 07:18:23.281371       1 controller.go:1087] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": volume "pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65" provisioned
I0327 07:18:23.281415       1 controller.go:1101] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": trying to save persistentvvolume "pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65"
I0327 07:18:23.284621       1 controller.go:1108] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": persistentvolume "pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65" saved
I0327 07:18:23.284723       1 controller.go:1149] provision "default/test-cephfs-ngx-wait-22-0" class "cephfs-provisioner-sc": succeeded
I0327 07:18:23.284810       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-22-0", UID:"7f6b60d5-5060-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347214256", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-7f6b60d5-5060-11e9-9a9c-c81f66bcff65

6.2. debug 日志

I0327 08:08:11.789608       1 controller.go:987] provision "default/test-cephfs-ngx-wait-44-0" class "cephfs-sc-wait": started
I0327 08:08:11.793258       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-44-0", UID:"81846859-5067-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347237916", FieldPath:""}): type: 'Normal' reason: 'Provisioning' External provisioner is provisioning volume for claim "default/test-cephfs-ngx-wait-44-0"
E0327 08:08:12.164705       1 cephfs-provisioner.go:158] failed to provision share "kubernetes-dynamic-pvc-76ecdc5a-5067-11e9-9421-2a1b1be1aeef" for "kubernetes-dynamic-user-76ecdcee-5067-11e9-9421-2a1b1be1aeef", err: exit status 1, output: Traceback (most recent call last):
  File "/usr/local/bin/cephfs_provisioner", line 364, in <module>
    main()
  File "/usr/local/bin/cephfs_provisioner", line 358, in main
    print cephfs.create_share(share, user, size=size)
  File "/usr/local/bin/cephfs_provisioner", line 228, in create_share
    volume = self.volume_client.create_volume(volume_path, size=size, namespace_isolated=not self.ceph_namespace_isolation_disabled)
  File "/usr/local/bin/cephfs_provisioner", line 112, in volume_client
    self._volume_client.connect(None)
  File "/lib/python2.7/site-packages/ceph_volume_client.py", line 458, in connect
    self.rados.connect()
  File "rados.pyx", line 895, in rados.Rados.connect (/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.1/rpm/el7/BUILD/ceph-13.2.1/build/src/pybind/rados/pyrex/rados.c:9815)
rados.IOError: [errno 5] error connecting to the cluster
W0327 08:08:12.164908       1 controller.go:746] Retrying syncing claim "default/test-cephfs-ngx-wait-44-0" because failures 2 < threshold 15
E0327 08:08:12.164977       1 controller.go:761] error syncing claim "default/test-cephfs-ngx-wait-44-0": failed to provision volume with StorageClass "cephfs-sc-wait": exit status 1
I0327 08:08:12.165974       1 event.go:221] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"default", Name:"test-cephfs-ngx-wait-44-0", UID:"81846859-5067-11e9-9a9c-c81f66bcff65", APIVersion:"v1", ResourceVersion:"347237916", FieldPath:""}): type: 'Warning' reason: 'ProvisioningFailed' failed to provision volume with StorageClass "cephfs-sc-wait": exit status 1

参考

6.2.4 -

1. FlexVolume介绍

Flexvolume提供了一种扩展k8s存储插件的方式,用户可以自定义自己的存储插件。类似的功能的实现还有CSI的方式。Flexvolume在k8s 1.8+以上版本提供GA功能版本。

2. 使用方式

在每个node节点安装存储插件二进制,该二进制实现flexvolume的相关接口,默认存储插件的存放路径为/usr/libexec/kubernetes/kubelet-plugins/volume/exec/<vendor~driver>/<driver>

其中vendor~driver的名字需要和pod中flexVolume.driver的字段名字匹配,该字段名字通过/替换~

例如:

  • path:/usr/libexec/kubernetes/kubelet-plugins/volume/exec/foo~cifs/cifs

  • pod中flexVolume.driver:foo/cifs

3. FlexVolume接口

节点上的存储插件需要实现以下的接口。

3.1. init

<driver executable> init

3.2. attach

<driver executable> attach <json options> <node name>

3.3. detach

<driver executable> detach <mount device> <node name>

3.4. waitforattach

<driver executable> waitforattach <mount device> <json options>

3.5. isattached

<driver executable> isattached <json options> <node name>

3.6. mountdevice

<driver executable> mountdevice <mount dir> <mount device> <json options>

3.7. unmountdevice

<driver executable> unmountdevice <mount device>

3.8. mount

<driver executable> mount <mount dir> <json options>

3.9. unmount

<driver executable> unmount <mount dir>

3.10. 插件输出

{
	"status": "<Success/Failure/Not supported>",
	"message": "<Reason for success/failure>",
	"device": "<Path to the device attached. This field is valid only for attach & waitforattach call-outs>"
	"volumeName": "<Cluster wide unique name of the volume. Valid only for getvolumename call-out>"
	"attached": <True/False (Return true if volume is attached on the node. Valid only for isattached call-out)>
    "capabilities": <Only included as part of the Init response>
    {
        "attach": <True/False (Return true if the driver implements attach and detach)>
    }
}

4. 示例

4.1. pod的yaml文件内容

nginx-nfs.yaml

相关参数为flexVolume.driver等。

apiVersion: v1
kind: Pod
metadata:
  name: nginx-nfs
  namespace: default
spec:
  containers:
  - name: nginx-nfs
    image: nginx
    volumeMounts:
    - name: test
      mountPath: /data
    ports:
    - containerPort: 80
  volumes:
  - name: test
    flexVolume:
      driver: "k8s/nfs"
      fsType: "nfs"
      options:
        server: "172.16.0.25"
        share: "dws_nas_scratch"

4.2. 插件脚本

nfs脚本实现了flexvolume的接口。

/usr/libexec/kubernetes/kubelet-plugins/volume/exec/k8s~nfs/nfs。

#!/bin/bash

# Copyright 2015 The Kubernetes Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Notes:
#  - Please install "jq" package before using this driver.
usage() {
	err "Invalid usage. Usage: "
	err "\t$0 init"
	err "\t$0 mount <mount dir> <json params>"
	err "\t$0 unmount <mount dir>"
	exit 1
}

err() {
	echo -ne $* 1>&2
}

log() {
	echo -ne $* >&1
}

ismounted() {
	MOUNT=`findmnt -n ${MNTPATH} 2>/dev/null | cut -d' ' -f1`
	if [ "${MOUNT}" == "${MNTPATH}" ]; then
		echo "1"
	else
		echo "0"
	fi
}

domount() {
	MNTPATH=$1

	NFS_SERVER=$(echo $2 | jq -r '.server')
	SHARE=$(echo $2 | jq -r '.share')

	if [ $(ismounted) -eq 1 ] ; then
		log '{"status": "Success"}'
		exit 0
	fi

	mkdir -p ${MNTPATH} &> /dev/null

	mount -t nfs ${NFS_SERVER}:/${SHARE} ${MNTPATH} &> /dev/null
	if [ $? -ne 0 ]; then
		err "{ \"status\": \"Failure\", \"message\": \"Failed to mount ${NFS_SERVER}:${SHARE} at ${MNTPATH}\"}"
		exit 1
	fi
	log '{"status": "Success"}'
	exit 0
}

unmount() {
	MNTPATH=$1
	if [ $(ismounted) -eq 0 ] ; then
		log '{"status": "Success"}'
		exit 0
	fi

	umount ${MNTPATH} &> /dev/null
	if [ $? -ne 0 ]; then
		err "{ \"status\": \"Failed\", \"message\": \"Failed to unmount volume at ${MNTPATH}\"}"
		exit 1
	fi

	log '{"status": "Success"}'
	exit 0
}

op=$1

if ! command -v jq >/dev/null 2>&1; then
	err "{ \"status\": \"Failure\", \"message\": \"'jq' binary not found. Please install jq package before using this driver\"}"
	exit 1
fi

if [ "$op" = "init" ]; then
	log '{"status": "Success", "capabilities": {"attach": false}}'
	exit 0
fi

if [ $# -lt 2 ]; then
	usage
fi

shift

case "$op" in
	mount)
		domount $*
		;;
	unmount)
		unmount $*
		;;
	*)
		log '{"status": "Not supported"}'
		exit 0
esac

exit 1

参考:

7 - 资源隔离

7.1 -

资源配额(ResourceQuota)

ResourceQuota对象用来定义某个命名空间下所有资源的使用限额,其实包括:

  • 计算资源的配额
  • 存储资源的配额
  • 对象数量的配额

如果集群的总容量小于命名空间的配额总额,可能会产生资源竞争。这时会按照先到先得来处理。 资源竞争和配额的更新都不会影响已经创建好的资源。

1. 启动资源配额

Kubernetes 的众多发行版本默认开启了资源配额的支持。当在apiserver的--admission-control配置中添加ResourceQuota参数后,便启用了。 当一个命名空间中含有ResourceQuota对象时,资源配额将强制执行。

2. 计算资源配额

可以在给定的命名空间中限制可以请求的计算资源(compute resources)的总量。

资源名称 描述
cpu 非终止态的所有pod, cpu请求总量不能超出此值。
limits.cpu 非终止态的所有pod, cpu限制总量不能超出此值。
limits.memory 非终止态的所有pod, 内存限制总量不能超出此值。
memory 非终止态的所有pod, 内存请求总量不能超出此值。
requests.cpu 非终止态的所有pod, cpu请求总量不能超出此值。
requests.memory 非终止态的所有pod, 内存请求总量不能超出此值。

3. 存储资源配额

可以在给定的命名空间中限制可以请求的存储资源(storage resources)的总量。

资源名称 描述
requests.storage 所有PVC, 存储请求总量不能超出此值。
persistentvolumeclaims 命名空间中可以存在的PVC(persistent volume claims)总数。
.storageclass.storage.k8s.io/requests.storage 和该存储类关联的所有PVC, 存储请求总和不能超出此值。
.storageclass.storage.k8s.io/persistentvolumeclaims 和该存储类关联的所有PVC,命名空间中可以存在的PVC(persistent volume claims)总数。

4. 对象数量的配额

资源名称 描述
congfigmaps 命名空间中可以存在的配置映射的总数。
persistentvolumeclaims 命名空间中可以存在的PVC总数。
pods 命名空间中可以存在的非终止态的pod总数。如果一个pod的status.phaseFailed, Succeeded, 则该pod处于终止态。
replicationcontrollers 命名空间中可以存在的rc总数。
resourcequotas 命名空间中可以存在的资源配额(resource quotas)总数。
services 命名空间中可以存在的服务总数量。
services.loadbalancers 命名空间中可以存在的服务的负载均衡的总数量。
services.nodeports 命名空间中可以存在的服务的主机接口的总数量。
secrets 命名空间中可以存在的secrets的总数量。

例如:可以定义pod的限额来避免某用户消耗过多的Pod IPs。

5. 限额的作用域

作用域 描述
Terminating 匹配 spec.activeDeadlineSeconds >= 0 的pod
NotTerminating 匹配 spec.activeDeadlineSeconds is nil 的pod
BestEffort 匹配具有最佳服务质量的pod
NotBestEffort 匹配具有非最佳服务质量的pod

6. request和limit

当分配计算资源时,每个容器可以为cpu或者内存指定一个请求值和一个限度值。可以配置限额值来限制它们中的任何一个值。 如果指定了requests.cpu 或者 requests.memory的限额值,那么就要求传入的每一个容器显式的指定这些资源的请求。如果指定了limits.cpu或者limits.memory,那么就要求传入的每一个容器显式的指定这些资源的限度。

7. 查看和设置配额

# 创建namespace
$ kubectl create namespace myspace

# 创建resourcequota
$ cat <<EOF > compute-resources.yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  name: compute-resources
spec:
  hard:
    pods: "4"
    requests.cpu: "1"
    requests.memory: 1Gi
    limits.cpu: "2"
    limits.memory: 2Gi
EOF
$ kubectl create -f ./compute-resources.yaml --namespace=myspace

# 查询resourcequota
$ kubectl get quota --namespace=myspace
NAME                    AGE
compute-resources       30s

# 查询resourcequota的详细信息
$ kubectl describe quota compute-resources --namespace=myspace
Name:                  compute-resources
Namespace:             myspace
Resource               Used Hard
--------               ---- ----
limits.cpu             0    2
limits.memory          0    2Gi
pods                   0    4
requests.cpu           0    1
requests.memory        0    1Gi

8. 配额和集群容量

资源配额对象与集群容量无关,它们以绝对单位表示。即增加节点的资源并不会增加已经配置的namespace的资源。

参考文章:

7.2 -

Pod限额(LimitRange)

ResourceQuota对象是限制某个namespace下所有Pod(容器)的资源限额

LimitRange对象是限制某个namespace单个Pod(容器)的资源限额

LimitRange对象用来定义某个命名空间下某种资源对象的使用限额,其中资源对象包括:PodContainerPersistentVolumeClaim

1. 为namespace配置CPU和内存的默认值

如果在一个拥有默认内存或CPU限额的命名空间中创建一个容器,并且这个容器未指定它自己的内存或CPU的limit, 它会被分配这个默认的内存或CPU的limit。既没有设置pod的limitrequest才会分配默认的内存或CPU的request

1.1. namespace的内存默认值

# 创建namespace
$ kubectl create namespace default-mem-example

# 创建LimitRange
$ cat memory-defaults.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-limit-range
spec:
  limits:
  - default:
      memory: 512Mi
    defaultRequest:
      memory: 256Mi
    type: Container
  
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-defaults.yaml --namespace=default-mem-example

# 创建Pod,未指定内存的limit和request
$ cat memory-defaults-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: default-mem-demo
spec:
  containers:
  - name: default-mem-demo-ctr
    image: nginx
    
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-defaults-pod.yaml --namespace=default-mem-example

# 查看Pod
$ kubectl get pod default-mem-demo --output=yaml --namespace=default-mem-example
containers:
- image: nginx
  imagePullPolicy: Always
  name: default-mem-demo-ctr
  resources:
    limits:
      memory: 512Mi
    requests:
      memory: 256Mi

1.2. namespace的CPU默认值

# 创建namespace
$ kubectl create namespace default-cpu-example

# 创建LimitRange
$ cat cpu-defaults.yaml 
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-limit-range
spec:
  limits:
  - default:
      cpu: 1
    defaultRequest:
      cpu: 0.5
    type: Container
    
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-defaults.yaml --namespace=default-cpu-example    

# 创建Pod,未指定CPU的limit和request
$ cat cpu-defaults-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: default-cpu-demo
spec:
  containers:
  - name: default-cpu-demo-ctr
    image: nginx

$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-defaults-pod.yaml --namespace=default-cpu-example

# 查看Pod
$ kubectl get pod default-cpu-demo --output=yaml --namespace=default-cpu-example
containers:
- image: nginx
  imagePullPolicy: Always
  name: default-cpu-demo-ctr
  resources:
    limits:
      cpu: "1"
    requests:
      cpu: 500m

1.3 说明

  1. 如果没有指定pod的requestlimit,则创建的pod会使用LimitRange对象定义的默认值(request和limit)
  2. 如果指定pod的limit但未指定request,则创建的pod的request值会取limit的值,而不会取LimitRange对象定义的request默认值。
  3. 如果指定pod的request但未指定limit,则创建的pod的limit值会取LimitRange对象定义的limit默认值。

默认Limit和request的动机

如果命名空间具有资源配额(ResourceQuota), 它为内存限额(CPU限额)设置默认值是有意义的。 以下是资源配额对命名空间施加的两个限制:

  • 在命名空间运行的每一个容器必须有它自己的内存限额(CPU限额)。
  • 在命名空间中所有的容器使用的内存总量(CPU总量)不能超出指定的限额。

如果一个容器没有指定它自己的内存限额(CPU限额),它将被赋予默认的限额值,然后它才可以在被配额限制的命名空间中运行。

2. 为namespace配置CPU和内存的最大最小值

2.1. 内存的最大最小值

创建LimitRange

# 创建namespace
$ kubectl create namespace constraints-mem-example

# 创建LimitRange
$ cat memory-constraints.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: mem-min-max-demo-lr
spec:
  limits:
  - max:
      memory: 1Gi
    min:
      memory: 500Mi
    type: Container
 
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints.yaml --namespace=constraints-mem-example

# 查看LimitRange
$ kubectl get limitrange cpu-min-max-demo --namespace=constraints-mem-example --output=yaml
...
  limits:
  - default:
      memory: 1Gi
    defaultRequest:
      memory: 1Gi
    max:
      memory: 1Gi
    min:
      memory: 500Mi
    type: Container
...
# LimitRange设置了最大最小值,但没有设置默认值,也会被自动设置默认值。

创建符合要求的Pod

# 创建符合要求的Pod
$ cat memory-constraints-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-mem-demo
spec:
  containers:
  - name: constraints-mem-demo-ctr
    image: nginx
    resources:
      limits:
        memory: "800Mi"
      requests:
        memory: "600Mi"
 
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod.yaml --namespace=constraints-mem-example

# 查看Pod
$ kubectl get pod constraints-mem-demo --output=yaml --namespace=constraints-mem-example
...
resources:
  limits:
     memory: 800Mi
  requests:
    memory: 600Mi
...

创建超过最大内存limit的pod

$ cat memory-constraints-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-mem-demo-2
spec:
  containers:
  - name: constraints-mem-demo-2-ctr
    image: nginx
    resources:
      limits:
        memory: "1.5Gi"  # 超过最大值 1Gi
      requests:
        memory: "800Mi"
        
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod-2.yaml --namespace=constraints-mem-example

# Pod创建失败,因为容器指定的limit过大
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/memory-constraints-pod-2.yaml":
pods "constraints-mem-demo-2" is forbidden: maximum memory usage per Container is 1Gi, but limit is 1536Mi.

创建小于最小内存request的Pod

$ cat memory-constraints-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-mem-demo-3
spec:
  containers:
  - name: constraints-mem-demo-3-ctr
    image: nginx
    resources:
      limits:
        memory: "800Mi"
      requests:
        memory: "100Mi"   # 小于最小值500Mi
        
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod-3.yaml --namespace=constraints-mem-example         

# Pod创建失败,因为容器指定的内存request过小
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/memory-constraints-pod-3.yaml":
pods "constraints-mem-demo-3" is forbidden: minimum memory usage per Container is 500Mi, but request is 100Mi.

创建没有指定任何内存limit和request的pod

$ cat memory-constraints-pod-4.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-mem-demo-4
spec:
  containers:
  - name: constraints-mem-demo-4-ctr
    image: nginx

$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/memory-constraints-pod-4.yaml --namespace=constraints-mem-example

# 查看Pod
$ kubectl get pod constraints-mem-demo-4 --namespace=constraints-mem-example --output=yaml
...
resources:
  limits:
    memory: 1Gi
  requests:
    memory: 1Gi
...

容器没有指定自己的 CPU 请求和限制,所以它将从 LimitRange 获取默认的 CPU 请求和限制值。

2.2. CPU的最大最小值

创建LimitRange

# 创建namespace
$ kubectl create namespace constraints-cpu-example

# 创建LimitRange
$ cat cpu-constraints.yaml
apiVersion: v1
kind: LimitRange
metadata:
  name: cpu-min-max-demo-lr
spec:
  limits:
  - max:
      cpu: "800m"
    min:
      cpu: "200m"
    type: Container
    
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints.yaml --namespace=constraints-cpu-example

# 查看LimitRange
$ kubectl get limitrange cpu-min-max-demo-lr --output=yaml --namespace=constraints-cpu-example
...
limits:
- default:
    cpu: 800m
  defaultRequest:
    cpu: 800m
  max:
    cpu: 800m
  min:
    cpu: 200m
  type: Container
...

创建符合要求的Pod

$ cat cpu-constraints-pod.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-cpu-demo
spec:
  containers:
  - name: constraints-cpu-demo-ctr
    image: nginx
    resources:
      limits:
        cpu: "800m"
      requests:
        cpu: "500m"
        
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod.yaml --namespace=constraints-cpu-example

# 查看Pod
$ kubectl get pod constraints-cpu-demo --output=yaml --namespace=constraints-cpu-example
...
resources:
  limits:
    cpu: 800m
  requests:
    cpu: 500m
...

创建超过最大CPU limit的Pod

$ cat cpu-constraints-pod-2.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-cpu-demo-2
spec:
  containers:
  - name: constraints-cpu-demo-2-ctr
    image: nginx
    resources:
      limits:
        cpu: "1.5"
      requests:
        cpu: "500m"
        
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod-2.yaml --namespace=constraints-cpu-example

# Pod创建失败,因为容器指定的CPU limit过大
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/cpu-constraints-pod-2.yaml":
pods "constraints-cpu-demo-2" is forbidden: maximum cpu usage per Container is 800m, but limit is 1500m.

创建小于最小CPU request的Pod

$ cat cpu-constraints-pod-3.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-cpu-demo-4
spec:
  containers:
  - name: constraints-cpu-demo-4-ctr
    image: nginx
    resources:
      limits:
        cpu: "800m"
      requests:
        cpu: "100m"
        
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod-3.yaml --namespace=constraints-cpu-example

# Pod创建失败,因为容器指定的CPU request过小
Error from server (Forbidden): error when creating "docs/tasks/administer-cluster/cpu-constraints-pod-3.yaml":
pods "constraints-cpu-demo-4" is forbidden: minimum cpu usage per Container is 200m, but request is 100m.

创建没有指定任何CPU limit和request的pod

$ cat cpu-constraints-pod-4.yaml
apiVersion: v1
kind: Pod
metadata:
  name: constraints-cpu-demo-4
spec:
  containers:
  - name: constraints-cpu-demo-4-ctr
    image: vish/stress
    
$ kubectl create -f https://k8s.io/docs/tasks/administer-cluster/cpu-constraints-pod-4.yaml --namespace=constraints-cpu-example    

# 查看Pod
kubectl get pod constraints-cpu-demo-4 --namespace=constraints-cpu-example --output=yaml
...
resources:
  limits:
    cpu: 800m
  requests:
    cpu: 800m
...

容器没有指定自己的 CPU 请求和限制,所以它将从 LimitRange 获取默认的 CPU 请求和限制值。

2.3. 说明

LimitRange 在 namespace 中施加的最小和最大内存(CPU)限制只有在创建和更新 Pod 时才会被应用。改变 LimitRange 不会对之前创建的 Pod 造成影响。

Kubernetes 都会执行下列步骤:

  • 如果容器没有指定自己的内存(CPU)请求(request)和限制(limit),系统将会为其分配默认值。
  • 验证容器的内存(CPU)请求大于等于最小值。
  • 验证容器的内存(CPU)限制小于等于最大值。

参考文章:

7.3 -

Resource Quality of Service

1. 资源QoS简介

request值表示容器保证可被分配到资源。limit表示容器可允许使用的最大资源。Pod级别的requestlimit是其所有容器的request和limit之和。

2. Requests and Limits

Pod可以指定requestlimit资源。其中0 <= request <=Node Allocatable & request <= limit <= Infinity。调度是基于request而不是limit,即如果Pod被成功调度,那么可以保证Pod分配到指定的 request的资源。Pod使用的资源能否超过指定的limit值取决于该资源是否可被压缩。

2.1. 可压缩的资源

  • 目前只支持CPU
  • pod可以保证获得它们请求的CPU数量,它们可能会也可能不会获得额外的CPU时间(取决于正在运行的其他作业)。因为目前CPU隔离是在容器级别而不是pod级别。

2.2. 不可压缩的资源

  • 目前只支持内存
  • pod将获得它们请求的内存数量,如果超过了它们的内存请求,它们可能会被杀死(如果其他一些pod需要内存),但如果pod消耗的内存小于请求的内存,那么它们将不会被杀死(除非在系统任务或守护进程需要更多内存的情况下)。

3. QoS 级别

在机器资源超卖的情况下(limit的总量大于机器的资源容量),即CPU或内存耗尽,将不得不杀死部分不重要的容器。因此对容器分成了3个QoS的级别:Guaranteed, Burstable, Best-Effort,三个级别的优先级依次递减。

当CPU资源无法满足,pod不会被杀死可能被短暂控制。

内存是不可压缩的资源,当内存耗尽的情况下,会依次杀死优先级低的容器。Guaranteed的级别最高,不会被杀死,除非容器使用量超过limit限值或者资源耗尽,已经没有更低级别的容器可驱逐。

3.1. Guaranteed

所有的容器的limit值和request值被配置且两者相等(如果只配置limit没有request,则request取值于limit)。

例如:

# 示例1
containers:
  name: foo
    resources:
      limits:
        cpu: 10m
        memory: 1Gi
  name: bar
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
# 示例2
containers:
  name: foo
    resources:
      limits:
        cpu: 10m
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 1Gi

  name: bar
    resources:
      limits:
        cpu: 100m
        memory: 100Mi
      requests:
        cpu: 100m
        memory: 100Mi

3.2. Burstable

如果一个或多个容器的limit和request值被配置且两者不相等。

例如:

# 示例1
containers:
  name: foo
    resources:
      limits:
        cpu: 10m
        memory: 1Gi
      requests:
        cpu: 10m
        memory: 1Gi

  name: bar
  
# 示例2
containers:
  name: foo
    resources:
      limits:
        memory: 1Gi

  name: bar
    resources:
      limits:
        cpu: 100m

# 示例3
containers:
  name: foo
    resources:
      requests:
        cpu: 10m
        memory: 1Gi

  name: bar

3.3. Best-Effort

所有的容器的limitrequest值都没有配置。

例如:

containers:
  name: foo
    resources:
  name: bar
    resources:

参考文章:

7.4 - Lxcfs资源视图隔离

1. 资源视图隔离

容器中的执行topfree等命令展示出来的CPU,内存等信息是从/proc目录中的相关文件里读取出来的。而容器并没有对/proc/sys等文件系统做隔离,因此容器中读取出来的CPU和内存的信息是宿主机的信息,与容器实际分配和限制的资源量不同。

/proc/cpuinfo
/proc/diskstats
/proc/meminfo
/proc/stat
/proc/swaps
/proc/uptime

为了实现让容器内部的资源视图更像虚拟机,使得应用程序可以拿到真实的CPU和内存信息,就需要通过文件挂载的方式将cgroup的真实的容器资源信息挂载到容器内/proc下的文件,使得容器内执行top、free等命令时可以拿到真实的CPU和内存信息。

2. Lxcfs简介

lxcfs是一个FUSE文件系统,使得Linux容器的文件系统更像虚拟机。lxcfs是一个常驻进程运行在宿主机上,从而来自动维护宿主机cgroup中容器的真实资源信息与容器内/proc下文件的映射关系。

lxcfs的命令信息如下:

#/usr/local/bin/lxcfs -h
Usage:

lxcfs [-f|-d] -u -l -n [-p pidfile] mountpoint
  -f running foreground by default; -d enable debug output
  -l use loadavg
  -u no swap
  Default pidfile is /run/lxcfs.pid
lxcfs -h

lxcfs的源码:https://github.com/lxc/lxcfs

3. Lxcfs原理

lxcfs实现的基本原理是通过文件挂载的方式,把cgroup中容器相关的信息读取出来,存储到lxcfs相关的目录下,并将相关目录映射到容器内的/proc目录下,从而使得容器内执行top,free等命令时拿到的/proc下的数据是真实的cgroup分配给容器的CPU和内存数据。

原理图

lxcfs

映射目录

类别 容器内目录 宿主机lxcfs目录
cpu /proc/cpuinfo /var/lib/lxcfs/{container_id}/proc/cpuinfo
内存 /proc/meminfo /var/lib/lxcfs/{container_id}/proc/meminfo
/proc/diskstats /var/lib/lxcfs/{container_id}/proc/diskstats
/proc/stat /var/lib/lxcfs/{container_id}/proc/stat
/proc/swaps /var/lib/lxcfs/{container_id}/proc/swaps
/proc/uptime /var/lib/lxcfs/{container_id}/proc/uptime
/proc/loadavg /var/lib/lxcfs/{container_id}/proc/loadavg
/sys/devices/system/cpu/online /var/lib/lxcfs/{container_id}/sys/devices/system/cpu/online

4. 使用方式

4.1. 安装lxcfs

环境准备

yum install -y fuse fuse-lib fuse-devel

源码编译安装

git clone git://github.com/lxc/lxcfs
cd lxcfs
./bootstrap.sh
./configure
make
make install

或者通过rpm包安装

wget https://copr-be.cloud.fedoraproject.org/results/ganto/lxc3/epel-7-x86_64/01041891-lxcfs/lxcfs-3.1.2-0.2.el7.x86_64.rpm;
rpm -ivh lxcfs-3.1.2-0.2.el7.x86_64.rpm --force --nodeps

查看是否安装成功

lxcfs -h

4.2. 运行lxcfs

运行lxcfs主要执行两条命令。

sudo mkdir -p /var/lib/lxcfs
sudo lxcfs /var/lib/lxcfs

可以通过systemd运行。

lxcfs.service文件:

cat > /usr/lib/systemd/system/lxcfs.service <<EOF
[Unit]
Description=lxcfs

[Service]
ExecStart=/usr/bin/lxcfs -f /var/lib/lxcfs
Restart=on-failure
#ExecReload=/bin/kill -s SIGHUP $MAINPID

[Install]
WantedBy=multi-user.target
EOF

运行命令

systemctl daemon-reload && systemctl enable lxcfs && systemctl start lxcfs && systemctl status lxcfs 

4.3. 挂载容器内/proc下的文件目录

docker run -it --rm -m 256m  --cpus 2  \
      -v /var/lib/lxcfs/proc/cpuinfo:/proc/cpuinfo:rw \
      -v /var/lib/lxcfs/proc/diskstats:/proc/diskstats:rw \
      -v /var/lib/lxcfs/proc/meminfo:/proc/meminfo:rw \
      -v /var/lib/lxcfs/proc/stat:/proc/stat:rw \
      -v /var/lib/lxcfs/proc/swaps:/proc/swaps:rw \
      -v /var/lib/lxcfs/proc/uptime:/proc/uptime:rw \
      nginx:latest /bin/sh

4.4. 验证容器内CPU和内存

# cpu
grep -c processor /proc/cpuinfo
cat /proc/cpuinfo

# memory
free -g
cat /proc/meminfo

5. 使用k8s集群部署

使用k8s集群部署与systemd部署方式同理,需要解决2个问题:

  1. 在每个node节点上部署lxcfs常驻进程,lxcfs需要通过镜像来运行,可以通过daemonset来部署。
  2. 实现将lxcfs维护的目录自动挂载到pod内的/proc目录。

具体可参考:https://github.com/denverdino/lxcfs-admission-webhook

5.1. lxcfs-image

Dockerfile

FROM centos:7 as build
RUN yum -y update
RUN yum -y install fuse-devel pam-devel wget install gcc automake autoconf libtool make
ENV LXCFS_VERSION 3.1.2
RUN wget https://linuxcontainers.org/downloads/lxcfs/lxcfs-$LXCFS_VERSION.tar.gz && \
	mkdir /lxcfs && tar xzvf lxcfs-$LXCFS_VERSION.tar.gz -C /lxcfs  --strip-components=1 && \
	cd /lxcfs && ./configure && make

FROM centos:7
STOPSIGNAL SIGINT
COPY --from=build /lxcfs/lxcfs /usr/local/bin/lxcfs
COPY --from=build /lxcfs/.libs/liblxcfs.so /usr/local/lib/lxcfs/liblxcfs.so
COPY --from=build /lxcfs/lxcfs /lxcfs/lxcfs
COPY --from=build /lxcfs/.libs/liblxcfs.so /lxcfs/liblxcfs.so
COPY --from=build /usr/lib64/libfuse.so.2.9.2 /usr/lib64/libfuse.so.2.9.2
COPY --from=build /usr/lib64/libulockmgr.so.1.0.1 /usr/lib64/libulockmgr.so.1.0.1
RUN ln -s /usr/lib64/libfuse.so.2.9.2 /usr/lib64/libfuse.so.2 && \
    ln -s /usr/lib64/libulockmgr.so.1.0.1 /usr/lib64/libulockmgr.so.1
COPY start.sh /
CMD ["/start.sh"]

star.sh

#!/bin/bash

# Cleanup
nsenter -m/proc/1/ns/mnt fusermount -u /var/lib/lxcfs 2> /dev/null || true
nsenter -m/proc/1/ns/mnt [ -L /etc/mtab ] || \
        sed -i "/^lxcfs \/var\/lib\/lxcfs fuse.lxcfs/d" /etc/mtab

# Prepare
mkdir -p /usr/local/lib/lxcfs /var/lib/lxcfs

# Update lxcfs
cp -f /lxcfs/lxcfs /usr/local/bin/lxcfs
cp -f /lxcfs/liblxcfs.so /usr/local/lib/lxcfs/liblxcfs.so


# Mount
exec nsenter -m/proc/1/ns/mnt /usr/local/bin/lxcfs /var/lib/lxcfs/

5.2. daemonset

lxcfs-daemonset.yaml

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: lxcfs
  labels:
    app: lxcfs
spec:
  selector:
    matchLabels:
      app: lxcfs
  template:
    metadata:
      labels:
        app: lxcfs
    spec:
      hostPID: true
      tolerations:
      - key: node-role.kubernetes.io/master
        effect: NoSchedule
      containers:
      - name: lxcfs
        image: registry.cn-hangzhou.aliyuncs.com/denverdino/lxcfs:3.1.2
        imagePullPolicy: Always
        securityContext:
          privileged: true
        volumeMounts:
        - name: cgroup
          mountPath: /sys/fs/cgroup
        - name: lxcfs
          mountPath: /var/lib/lxcfs
          mountPropagation: Bidirectional
        - name: usr-local
          mountPath: /usr/local
      volumes:
      - name: cgroup
        hostPath:
          path: /sys/fs/cgroup
      - name: usr-local
        hostPath:
          path: /usr/local
      - name: lxcfs
        hostPath:
          path: /var/lib/lxcfs
          type: DirectoryOrCreate

5.3. lxcfs-admission-webhook

lxcfs-admission-webhook实现了一个动态的准入webhook,更准确的讲是实现了一个修改性质的webhook,即监听pod的创建,然后对pod执行patch的操作,从而将lxcfs与容器内的目录映射关系植入到pod创建的yaml中从而实现自动挂载。

deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: lxcfs-admission-webhook-deployment
  labels:
    app: lxcfs-admission-webhook
spec:
  replicas: 1
  selector:
    matchLabels:
      app: lxcfs-admission-webhook
  template:
    metadata:
      labels:
        app: lxcfs-admission-webhook
    spec:
      containers:
        - name: lxcfs-admission-webhook
          image: registry.cn-hangzhou.aliyuncs.com/denverdino/lxcfs-admission-webhook:v1
          imagePullPolicy: IfNotPresent
          args:
            - -tlsCertFile=/etc/webhook/certs/cert.pem
            - -tlsKeyFile=/etc/webhook/certs/key.pem
            - -alsologtostderr
            - -v=4
            - 2>&1
          volumeMounts:
            - name: webhook-certs
              mountPath: /etc/webhook/certs
              readOnly: true
      volumes:
        - name: webhook-certs
          secret:
            secretName: lxcfs-admission-webhook-certs

具体部署参考:install.sh

#!/bin/bash

./deployment/webhook-create-signed-cert.sh
kubectl get secret lxcfs-admission-webhook-certs

kubectl create -f deployment/deployment.yaml
kubectl create -f deployment/service.yaml
cat ./deployment/mutatingwebhook.yaml | ./deployment/webhook-patch-ca-bundle.sh > ./deployment/mutatingwebhook-ca-bundle.yaml
kubectl create -f deployment/mutatingwebhook-ca-bundle.yaml

执行命令

/deployment/install.sh

参考:

8 - 运维指南

8.1 - Kubernetes集群问题排查

1. 查看系统Event事件

kubectl describe pod <PodName> --namespace=<NAMESPACE> 

该命令可以显示Pod创建时的配置定义、状态等信息和最近的Event事件,事件信息可用于排错。例如当Pod状态为Pending,可通过查看Event事件确认原因,一般原因有几种:

  • 没有可用的Node可调度
  • 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源
  • 正在下载镜像(镜像拉取耗时太久)或镜像下载失败。

kubectl describe还可以查看其它k8s对象:NODE,RC,Service,Namespace,Secrets。

1.1. Pod

kubectl describe pod <PodName> --namespace=<NAMESPACE>

以下是容器的启动命令非阻塞式导致容器挂掉,被k8s频繁重启所产生的事件。

kubectl describe pod <PodName> --namespace=<NAMESPACE>  

Events:
  FirstSeen LastSeen    Count   From            SubobjectPath       Reason      Message
  ───────── ────────    ─────   ────            ─────────────       ──────      ───────
  7m        7m      1   {scheduler }                    Scheduled   Successfully assigned yangsc-1-0-0-index0 to 10.8.216.19
  7m        7m      1   {kubelet 10.8.216.19}   containers{infra}   Pulled      Container image "gcr.io/kube-system/pause:0.8.0" already present on machine
  7m        7m      1   {kubelet 10.8.216.19}   containers{infra}   Created     Created with docker id 84f133c324d0
  7m        7m      1   {kubelet 10.8.216.19}   containers{infra}   Started     Started with docker id 84f133c324d0
  7m        7m      1   {kubelet 10.8.216.19}   containers{yangsc0} Started     Started with docker id 3f9f82abb145
  7m        7m      1   {kubelet 10.8.216.19}   containers{yangsc0} Created     Created with docker id 3f9f82abb145
  7m        7m      1   {kubelet 10.8.216.19}   containers{yangsc0} Created     Created with docker id fb112e4002f4
  7m        7m      1   {kubelet 10.8.216.19}   containers{yangsc0} Started     Started with docker id fb112e4002f4
  6m        6m      1   {kubelet 10.8.216.19}   containers{yangsc0} Created     Created with docker id 613b119d4474
  6m        6m      1   {kubelet 10.8.216.19}   containers{yangsc0} Started     Started with docker id 613b119d4474
  6m        6m      1   {kubelet 10.8.216.19}   containers{yangsc0} Created     Created with docker id 25cb68d1fd3d
  6m        6m      1   {kubelet 10.8.216.19}   containers{yangsc0} Started     Started with docker id 25cb68d1fd3d
  5m        5m      1   {kubelet 10.8.216.19}   containers{yangsc0} Started     Started with docker id 7d9ee8610b28
  5m        5m      1   {kubelet 10.8.216.19}   containers{yangsc0} Created     Created with docker id 7d9ee8610b28
  3m        3m      1   {kubelet 10.8.216.19}   containers{yangsc0} Started     Started with docker id 88b9e8d582dd
  3m        3m      1   {kubelet 10.8.216.19}   containers{yangsc0} Created     Created with docker id 88b9e8d582dd
  7m        1m      7   {kubelet 10.8.216.19}   containers{yangsc0} Pulling     Pulling image "gcr.io/test/tcp-hello:1.0.0"
  1m        1m      1   {kubelet 10.8.216.19}   containers{yangsc0} Started     Started with docker id 089abff050e7
  1m        1m      1   {kubelet 10.8.216.19}   containers{yangsc0} Created     Created with docker id 089abff050e7
  7m        1m      7   {kubelet 10.8.216.19}   containers{yangsc0} Pulled      Successfully pulled image "gcr.io/test/tcp-hello:1.0.0"
  6m        7s      34  {kubelet 10.8.216.19}   containers{yangsc0} Backoff     Back-off restarting failed docker container

1.2. NODE

kubectl describe node 10.8.216.20
[root@FC-43745A-10 ~]# kubectl describe node 10.8.216.20  
Name:           10.8.216.20  
Labels:         kubernetes.io/hostname=10.8.216.20,namespace/bcs-cc=true,namespace/myview=true  
CreationTimestamp:  Mon, 17 Apr 2017 11:32:52 +0800  
Phase:            
Conditions:  
  Type      Status  LastHeartbeatTime           LastTransitionTime          Reason              Message  
  ────      ──────  ─────────────────           ──────────────────          ──────              ───────  
  Ready     True    Fri, 18 Aug 2017 09:38:33 +0800     Tue, 02 May 2017 17:40:58 +0800     KubeletReady            kubelet is posting ready status  
  OutOfDisk     False   Fri, 18 Aug 2017 09:38:33 +0800     Mon, 17 Apr 2017 11:31:27 +0800     KubeletHasSufficientDisk    kubelet has sufficient disk space available  
Addresses:  10.8.216.20,10.8.216.20  
Capacity:  
 cpu:       32  
 memory:    67323039744  
 pods:      40  
System Info:  
 Machine ID:            723bafc7f6764022972b3eae1ce6b198  
 System UUID:           4C4C4544-0042-4210-8044-C3C04F595631  
 Boot ID:           da01f2e3-987a-425a-9ca7-1caaec35d1e5  
 Kernel Version:        3.10.0-327.28.3.el7.x86_64  
 OS Image:          CentOS Linux 7 (Core)  
 Container Runtime Version: docker://1.13.1  
 Kubelet Version:       v1.1.1-xxx2-13.1+79c90c68bfb72f-dirty  
 Kube-Proxy Version:        v1.1.1-xxx2-13.1+79c90c68bfb72f-dirty  
ExternalID:         10.8.216.20  
Non-terminated Pods:        (6 in total)  
  Namespace         Name                    CPU Requests    CPU Limits  Memory Requests Memory Limits  
  ─────────         ────                    ────────────    ──────────  ─────────────── ─────────────  
  bcs-cc            bcs-cc-api-0-0-1364-index0      1 (3%)      1 (3%)      4294967296 (6%) 4294967296 (6%)  
  bcs-cc            bcs-cc-api-0-0-1444-index0      1 (3%)      1 (3%)      4294967296 (6%) 4294967296 (6%)  
  fw                fw-demo2-0-0-1519-index0        1 (3%)      1 (3%)      4294967296 (6%) 4294967296 (6%)  
  myview            myview-api-0-0-1362-index0      1 (3%)      1 (3%)      4294967296 (6%) 4294967296 (6%)  
  myview            myview-api-0-0-1442-index0      1 (3%)      1 (3%)      4294967296 (6%) 4294967296 (6%)  
  qa-ts-dna         ts-dna-console3-0-0-1434-index0     1 (3%)      1 (3%)      4294967296 (6%) 4294967296 (6%)  
Allocated resources:  
  (Total limits may be over 100%, i.e., overcommitted. More info: http://releases.k8s.io/HEAD/docs/user-guide/compute-resources.md)  
  CPU Requests  CPU Limits  Memory Requests     Memory Limits  
  ────────────  ──────────  ───────────────     ─────────────  
  6 (18%)   6 (18%)     25769803776 (38%)   25769803776 (38%)  
No events.  

1.3. RC

kubectl describe rc mytest-1-0-0 --namespace=test
[root@FC-43745A-10 ~]# kubectl describe rc mytest-1-0-0 --namespace=test  
Name:       mytest-1-0-0  
Namespace:  test  
Image(s):   gcr.io/test/mywebcalculator:1.0.1  
Selector:   app=mytest,appVersion=1.0.0  
Labels:     app=mytest,appVersion=1.0.0,env=ts,zone=inner  
Replicas:   1 current / 1 desired  
Pods Status:    1 Running / 0 Waiting / 0 Succeeded / 0 Failed  
No volumes.  
Events:  
  FirstSeen LastSeen    Count   From                SubobjectPath   Reason          Message  
  ───────── ────────    ─────   ────                ─────────────   ──────          ───────  
  20h       19h     9   {replication-controller }           FailedCreate        Error creating: Pod "mytest-1-0-0-index0" is forbidden: limited to 10 pods  
  20h       17h     7   {replication-controller }           FailedCreate        Error creating: pods "mytest-1-0-0-index0" already exists  
  20h       17h     4   {replication-controller }           SuccessfulCreate    Created pod: mytest-1-0-0-index0  

1.4. NAMESPACE

kubectl describe namespace test
[root@FC-43745A-10 ~]# kubectl describe namespace test  
Name:   test  
Labels: <none>  
Status: Active  
  
Resource Quotas  
 Resource       Used        Hard  
 ---            ---     ---  
 cpu            5       20  
 memory         1342177280  53687091200  
 persistentvolumeclaims 0       10  
 pods           4       10  
 replicationcontrollers 8       20  
 resourcequotas     1       1  
 secrets        3       10  
 services       8       20  
  
No resource limits.  

1.5. Service

kubectl describe service xxx-containers-1-1-0 --namespace=test
[root@FC-43745A-10 ~]# kubectl describe service xxx-containers-1-1-0 --namespace=test  
Name:           xxx-containers-1-1-0  
Namespace:      test  
Labels:         app=xxx-containers,appVersion=1.1.0,env=ts,zone=inner  
Selector:       app=xxx-containers,appVersion=1.1.0  
Type:           ClusterIP  
IP:         10.254.46.42  
Port:           port-dna-tcp-35913  35913/TCP  
Endpoints:      10.0.92.17:35913  
Port:           port-l7-tcp-8080    8080/TCP  
Endpoints:      10.0.92.17:8080  
Session Affinity:   None  
No events.  

2. 查看容器日志

1、查看指定pod的日志

kubectl logs <pod_name>

kubectl logs -f <pod_name> #类似tail -f的方式查看

2、查看上一个pod的日志

kubectl logs -p <pod_name>

3、查看指定pod中指定容器的日志

kubectl logs <pod_name> -c <container_name>

4、kubectl logs --help

[root@node5 ~]# kubectl logs --help  
Print the logs for a container in a pod. If the pod has only one container, the container name is optional.  
Usage:  
  kubectl logs [-f] [-p] POD [-c CONTAINER] [flags]  
Aliases:  
  logs, log  
   
Examples:  
# Return snapshot logs from pod nginx with only one container  
$ kubectl logs nginx  
# Return snapshot of previous terminated ruby container logs from pod web-1  
$ kubectl logs -p -c ruby web-1  
# Begin streaming the logs of the ruby container in pod web-1  
$ kubectl logs -f -c ruby web-1  
# Display only the most recent 20 lines of output in pod nginx  
$ kubectl logs --tail=20 nginx  
# Show all logs from pod nginx written in the last hour  
$ kubectl logs --since=1h nginx  

3. 查看k8s服务日志

3.1. journalctl

在Linux系统上systemd系统来管理kubernetes服务,并且journal系统会接管服务程序的输出日志,可以通过systemctl status 或journalctl -u -f来查看kubernetes服务的日志。

其中kubernetes组件包括:

k8s组件 涉及日志内容 备注
kube-apiserver
kube-controller-manager Pod扩容相关或RC相关
kube-scheduler Pod扩容相关或RC相关
kubelet Pod生命周期相关:创建、停止等
etcd

3.2. 日志文件

也可以通过指定日志存放目录来保存和查看日志

  • --logtostderr=false:不输出到stderr
  • --log-dir=/var/log/kubernetes:日志的存放目录
  • --alsologtostderr=false:设置为true表示日志输出到文件也输出到stderr
  • --v=0:glog的日志级别
  • --vmodule=gfs*=2,test*=4:glog基于模块的详细日志级别

4. 常见问题

4.1. Pod状态一直为Pending

kubectl describe <pod_name> --namespace=<NAMESPACE>

查看该POD的事件。

  • 正在下载镜像但拉取不下来(镜像拉取耗时太久)[一般都是该原因]
  • 没有可用的Node可调度
  • 开启了资源配额管理并且当前Pod的目标节点上恰好没有可用的资源

解决方法:

  1. 查看该POD所在宿主机与镜像仓库之间的网络是否有问题,可以手动拉取镜像
  2. 删除POD实例,让POD调度到别的宿主机上

4.2. Pod创建后不断重启

kubectl get pods中Pod状态一会running,一会不是,且RESTARTS次数不断增加。

一般原因为容器启动命令不是阻塞式命令,导致容器运行后马上退出。

非阻塞式命令:

  • 本身CMD指定的命令就是非阻塞式命令
  • 将服务启动方式设置为后台运行

解决方法:

1、将命令改为阻塞式命令(前台运行),例如:zkServer.sh start-foreground

2、java运行程序的启动脚本将 nohup xxx &的nobup和&去掉,例如:

nohup JAVA_HOME/bin/java JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main &

改为:

JAVA_HOME/bin/java JAVA_OPTS -cp $CLASSPATH com.cnc.open.processor.Main

文章参考《Kubernetes权威指南》

8.2 - kubectl工具

8.2.1 - kubectl安装与配置

1. kubectl的安装

curl -LO https://storage.googleapis.com/kubernetes-release/release/$(curl -s https://storage.googleapis.com/kubernetes-release/release/stable.txt)/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/

安装指定版本的kubectl,例如:v1.9.0

curl -LO https://storage.googleapis.com/kubernetes-release/release/v1.9.0/bin/linux/amd64/kubectl && chmod +x kubectl && sudo mv kubectl /usr/local/bin/

2. 配置k8s集群环境

2.1. 命令行方式

2.1.1 非安全方式

kubectl config set-cluster k8s --server=http://<url> 
kubectl config set-context <NAMESPACE> --cluster=k8s --namespace=<NAMESPACE> 

kubectl config use-context <NAMESPACE> 

2.1.2 安全方式

kubectl config set-cluster k8s --server=https://<url> --insecure-skip-tls-verify=true
kubectl config set-credentials k8s-user --username=<username> --password=<password>

kubectl config set-context <NAMESPACE> --cluster=k8s --user=k8s-user --namespace=<NAMESPACE> 
kubectl config use-context <NAMESPACE>

2.1.3 查询当前配置环境

[root@test ]# kubectl cluster-info
Kubernetes master is running at http://192.168.10.3:8081

2.2. 添加配置文件的方式

当没有指定 --kubeconfig参数和$KUBECONFIG的环境变量的时候,会默认读取${HOME}/.kube/config

因此创建${HOME}/.kube/config文件,并在``${HOME}/.kube/ssl`目录下创建ca.pem、cert.pem、key.pem文件。

内容如下:

apiVersion: v1
kind: Config
clusters:
- name: local
  cluster:
    certificate-authority: ./ssl/ca.pem
    server: https://192.168.10.3:6443
users:
- name: kubelet
  user:
    client-certificate: ./ssl/cert.pem
    client-key: ./ssl/key.pem
contexts:
- context:
    cluster: local
    user: kubelet
  name: kubelet-cluster.local
current-context: kubelet-cluster.local

3. kubectl config

kubectl config命令说明

$ kubectl config --help
Modify kubeconfig files using subcommands like "kubectl config set current-context my-context"

The loading order follows these rules:

  1. If the --kubeconfig flag is set, then only that file is loaded.  The flag may only be set once and no merging takes
place.
  2. If $KUBECONFIG environment variable is set, then it is used a list of paths (normal path delimitting rules for your
system).  These paths are merged.  When a value is modified, it is modified in the file that defines the stanza.  When a
value is created, it is created in the first file that exists.  If no files in the chain exist, then it creates the last
file in the list.
  3. Otherwise, ${HOME}/.kube/config is used and no merging takes place.

Available Commands:
  current-context Displays the current-context
  delete-cluster  Delete the specified cluster from the kubeconfig
  delete-context  Delete the specified context from the kubeconfig
  get-clusters    Display clusters defined in the kubeconfig
  get-contexts    Describe one or many contexts
  rename-context  Renames a context from the kubeconfig file.
  set             Sets an individual value in a kubeconfig file
  set-cluster     Sets a cluster entry in kubeconfig
  set-context     Sets a context entry in kubeconfig
  set-credentials Sets a user entry in kubeconfig
  unset           Unsets an individual value in a kubeconfig file
  use-context     Sets the current-context in a kubeconfig file
  view            Display merged kubeconfig settings or a specified kubeconfig file

Usage:
  kubectl config SUBCOMMAND [options]

Use "kubectl <command> --help" for more information about a given command.
Use "kubectl options" for a list of global command-line options (applies to all commands).

4. shell自动补齐

source <(kubectl completion bash)
echo "source <(kubectl completion bash)" >> ~/.bashrc

如果出现以下报错

# kubectl自动补齐失败
kubectl _get_comp_words_by_ref : command not found

解决方法:

yum install bash-completion -y

source /etc/profile.d/bash_completion.sh 

参考文章:

8.2.2 - kubectl命令说明

1. kubectl命令介绍

kubectl的命令语法

kubectl [command] [TYPE] [NAME] [flags]

其中command,TYPE,NAME,和flags分别是:

  • command: 指定要在一个或多个资源进行操作,例如creategetdescribedelete

  • TYPE:指定资源类型。资源类型区分大小写,您可以指定单数,复数或缩写形式。例如,以下命令产生相同的输出:

    kubectl get pod pod1  
    kubectl get pods pod1 
    kubectl get po pod1
    
  • NAME:指定资源的名称。名称区分大小写。如果省略名称,则会显示所有资源的详细信息,比如$ kubectl get pods

    按类型和名称指定多种资源:

    * 要分组资源,如果它们都是相同的类型:`TYPE1 name1 name2 name<#>`.<br/>
    例: `$ kubectl get pod example-pod1 example-pod2`
    
    * 要分别指定多种资源类型:  `TYPE1/name1 TYPE1/name2 TYPE2/name3 TYPE<#>/name<#>`.<br/>
    例: `$ kubectl get pod/example-pod1 replicationcontroller/example-rc1`
    
  • flags:指定可选标志。例如,您可以使用-s--serverflags来指定Kubernetes API服务器的地址和端口。

更多命令介绍:

[root@node5 ~]# kubectl
kubectl controls the Kubernetes cluster manager.

Find more information at https://github.com/kubernetes/kubernetes.

Basic Commands (Beginner):
  create         Create a resource from a file or from stdin.
  expose         Take a replication controller, service, deployment or pod and expose it as a new Kubernetes Service
  run            Run a particular image on the cluster
  set            Set specific features on objects
  run-container  Run a particular image on the cluster. This command is deprecated, use "run" instead

Basic Commands (Intermediate):
  get            Display one or many resources
  explain        Documentation of resources
  edit           Edit a resource on the server
  delete         Delete resources by filenames, stdin, resources and names, or by resources and label selector

Deploy Commands:
  rollout        Manage the rollout of a resource
  rolling-update Perform a rolling update of the given ReplicationController
  scale          Set a new size for a Deployment, ReplicaSet, Replication Controller, or Job
  autoscale      Auto-scale a Deployment, ReplicaSet, or ReplicationController

Cluster Management Commands:
  certificate    Modify certificate resources.
  cluster-info   Display cluster info
  top            Display Resource (CPU/Memory/Storage) usage.
  cordon         Mark node as unschedulable
  uncordon       Mark node as schedulable
  drain          Drain node in preparation for maintenance
  taint          Update the taints on one or more nodes

Troubleshooting and Debugging Commands:
  describe       Show details of a specific resource or group of resources
  logs           Print the logs for a container in a pod
  attach         Attach to a running container
  exec           Execute a command in a container
  port-forward   Forward one or more local ports to a pod
  proxy          Run a proxy to the Kubernetes API server
  cp             Copy files and directories to and from containers.
  auth           Inspect authorization

Advanced Commands:
  apply          Apply a configuration to a resource by filename or stdin
  patch          Update field(s) of a resource using strategic merge patch
  replace        Replace a resource by filename or stdin
  convert        Convert config files between different API versions

Settings Commands:
  label          Update the labels on a resource
  annotate       Update the annotations on a resource
  completion     Output shell completion code for the specified shell (bash or zsh)

Other Commands:
  api-versions   Print the supported API versions on the server, in the form of "group/version"
  config         Modify kubeconfig files
  help           Help about any command
  plugin         Runs a command-line plugin
  version        Print the client and server version information

Use "kubectl <command> --help" for more information about a given command.
Use "kubectl options" for a list of global command-line options (applies to all commands).

2. 操作的常用资源对象

  1. Node
  2. Podes
  3. Replication Controllers
  4. Services
  5. Namespace
  6. Deployment
  7. StatefulSet

具体对象类型及缩写:

  * all
  * certificatesigningrequests (aka 'csr')
  * clusterrolebindings
  * clusterroles
  * componentstatuses (aka 'cs')
  * configmaps (aka 'cm')
  * controllerrevisions
  * cronjobs
  * customresourcedefinition (aka 'crd')
  * daemonsets (aka 'ds')
  * deployments (aka 'deploy')
  * endpoints (aka 'ep')
  * events (aka 'ev')
  * horizontalpodautoscalers (aka 'hpa')
  * ingresses (aka 'ing')
  * jobs
  * limitranges (aka 'limits')
  * namespaces (aka 'ns')
  * networkpolicies (aka 'netpol')
  * nodes (aka 'no')
  * persistentvolumeclaims (aka 'pvc')
  * persistentvolumes (aka 'pv')
  * poddisruptionbudgets (aka 'pdb')
  * podpreset
  * pods (aka 'po')
  * podsecuritypolicies (aka 'psp')
  * podtemplates
  * replicasets (aka 'rs')
  * replicationcontrollers (aka 'rc')
  * resourcequotas (aka 'quota')
  * rolebindings
  * roles
  * secrets
  * serviceaccounts (aka 'sa')
  * services (aka 'svc')
  * statefulsets (aka 'sts')
  * storageclasses (aka 'sc')

3. kubectl命令分类[command]

3.1 增

1)create:[Create a resource by filename or stdin]

2)run:[ Run a particular image on the cluster]

3)apply:[Apply a configuration to a resource by filename or stdin]

4)proxy:[Run a proxy to the Kubernetes API server ]

3.2 删

1)delete:[Delete resources ]

3.3 改

1)scale:[Set a new size for a Replication Controller]

2)exec:[Execute a command in a container]

3)attach:[Attach to a running container]

4)patch:[Update field(s) of a resource by stdin]

5)edit:[Edit a resource on the server]

6) label:[Update the labels on a resource]

7)annotate:[Auto-scale a replication controller]

8)replace:[Replace a resource by filename or stdin]

9)config:[config modifies kubeconfig files]

3.4 查

1)get:[Display one or many resources]

2)describe:[Show details of a specific resource or group of resources]

3)log:[Print the logs for a container in a pod]

4)cluster-info:[Display cluster info]

5) version:[Print the client and server version information]

6)api-versions:[Print the supported API versions]

4. Pod相关命令

4.1 查询Pod

kubectl get pod -o wide --namespace=<NAMESPACE>

4.2 进入Pod

kubectl exec -it <PodName> /bin/bash --namespace=<NAMESPACE>

# 进入Pod中指定容器
kubectl exec -it <PodName> -c <ContainerName> /bin/bash --namespace=<NAMESPACE>

4.3 删除Pod

kubectl delete pod <PodName> --namespace=<NAMESPACE>

# 强制删除Pod,当Pod一直处于Terminating状态
kubectl delete pod <PodName> --namespace=<NAMESPACE> --force --grace-period=0

# 删除某个namespace下某个类型的所有对象
kubectl delete deploy --all --namespace=test

4.4 日志查看

$ 查看运行容器日志 
kubectl logs <PodName> --namespace=<NAMESPACE>
$ 查看上一个挂掉的容器日志 
kubectl logs <PodName> -p --namespace=<NAMESPACE> 

5. 常用命令

5.1. Node隔离与恢复

说明:Node设置隔离之后,原先运行在该Node上的Pod不受影响,后续的Pod不会调度到被隔离的Node上。

1. Node隔离

# cordon命令
kubectl cordon <NodeName>
# 或者
kubectl patch node <NodeName> -p '{"spec":{"unschedulable":true}}'

2. Node恢复

# uncordon
kubectl uncordon <NodeName>
# 或者
kubectl patch node <NodeName> -p '{"spec":{"unschedulable":false}}'

5.2. kubectl label

1. 固定Pod到指定机器

kubectl label node <NodeName> namespace/<NAMESPACE>=true

2. 取消Pod固定机器

kubectl label node <NodeName> namespace/<NAMESPACE>-

5.3. 升级镜像

# 升级镜像
kubectl set image deployment/nginx nginx=nginx:1.15.12 -n nginx
# 查看滚动升级情况
kubectl rollout status deployment/nginx  -n nginx

5.4. 调整资源值

# 调整指定容器的资源值
kubectl set resources sts nginx-0 -c=agent --limits=memory=512Mi -n nginx

5.5. 调整readiness probe

# 批量查看readiness probe timeoutSeconds
kubectl get statefulset -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.template.spec.containers[0].readinessProbe.timeoutSeconds}{"\n"}{end}'

# 调整readiness probe timeoutSeconds参数
kubectl patch statefulset nginx-sts --type='json' -p='[{"op": "replace", "path": "/spec/template/spec/containers/0/readinessProbe/timeoutSeconds", "value":5}]' -n nginx

5.6. 调整tolerations属性

kubectl patch statefulset nginx-sts --patch '{"spec": {"template": {"spec": {"tolerations": [{"effect": "NoSchedule","key": "dedicated","operator": "Equal","value": "nginx"}]}}}}' -n nginx

5.7. 查看所有节点的IP

kubectl get nodes -o=jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.addresses[0].address}{"\n"}{end}'

5.8. 查看当前k8s组件leader节点

当k8s集群高可用部署的时候,kube-controller-managerkube-scheduler只能一个服务处于实际逻辑运行状态,通过参数--leader-elect=true来开启选举操作。以下提供查询leader节点的命令。

$ kubectl get endpoints kube-controller-manager --namespace=kube-system  -o yaml

apiVersion: v1
kind: Endpoints
metadata:
  annotations:
    control-plane.alpha.kubernetes.io/leader: '{"holderIdentity":"xxx.xxx.xxx.xxx_6537b938-7f5a-11e9-8487-00220d338975","leaseDurationSeconds":15,"acquireTime":"2019-05-26T02:03:18Z","renewTime":"2019-05-26T02:06:08Z","leaderTransitions":1}'
  creationTimestamp: "2019-05-26T01:52:39Z"
  name: kube-controller-manager
  namespace: kube-system
  resourceVersion: "1965"
  selfLink: /api/v1/namespaces/kube-system/endpoints/kube-controller-manager
  uid: f1755fc5-7f58-11e9-b4c4-00220d338975

以上表示"holderIdentity":"xxx.xxx.xxx.xxx为kube-controller-manager的leader节点。

同理,可以通过以下命令查看kube-scheduler的leader节点。

kubectl get endpoints kube-scheduler --namespace=kube-system  -o yaml

5.9. 修改副本数

kubectl scale deployment.v1.apps/nginx-deployment --replicas=10

5.10. 批量删除pod

kubectl get po -n default |grep Evicted |awk '{print $1}' |xargs -I {} kubectl delete po  {} -n default

5.11. 各种查看命令

# 不使用外部工具来输出解码后的 Secret
kubectl get secret my-secret -o go-template='{{range $k,$v := .data}}{{"### "}}{{$k}}{{"\n"}}{{$v|base64decode}}{{"\n\n"}}{{end}}'

# 列出事件(Events),按时间戳排序
kubectl get events --sort-by=.metadata.creationTimestamp

6. kubectl日志级别

Kubectl 日志输出详细程度是通过 -v 或者 --v 来控制的,参数后跟一个数字表示日志的级别。 Kubernetes 通用的日志习惯和相关的日志级别在 这里 有相应的描述。

详细程度 描述
--v=0 用于那些应该 始终 对运维人员可见的信息,因为这些信息一般很有用。
--v=1 如果您不想要看到冗余信息,此值是一个合理的默认日志级别。
--v=2 输出有关服务的稳定状态的信息以及重要的日志消息,这些信息可能与系统中的重大变化有关。这是建议大多数系统设置的默认日志级别。
--v=3 包含有关系统状态变化的扩展信息。
--v=4 包含调试级别的冗余信息。
--v=5 跟踪级别的详细程度。
--v=6 显示所请求的资源。
--v=7 显示 HTTP 请求头。
--v=8 显示 HTTP 请求内容。
--v=9 显示 HTTP 请求内容而且不截断内容。

参考文章:

8.2.3 - kubectl命令别名

1. kubectl-aliases

kubectl-aliases开源工具是由脚本通过拼接各种kubectl相关元素组成的alias命令别名列表,其中命令别名拼接元素如下:

base [system?] [operation] [resource] [flags]
kubectl -n=kube-system get
describe
rm:delete
logs
exec
apply
pods
deployment
secret
ingress
node
svc
ns
cm
oyaml
ojson
owide
all
watch
file
l
  • k=kubectl
    • sys=--namespace kube-system
  • commands:
    • g=get
    • d=describe
    • rm=delete
    • a:apply -f
    • ex: exec -i -t
    • lo: logs -f
  • resources:
    • po=pod
    • dep=deployment
    • ing=ingress
    • svc=service
    • cm=configmap
    • sec=secret
    • ns=namespace
    • no=node
  • flags:
    • output format: oyaml, ojson, owide
    • all: --all or --all-namespaces depending on the command
    • sl: --show-labels
    • w=-w/--watch
  • value flags (should be at the end):
    • f=-f/--filename
    • l=-l/--selector

2. 示例

# 示例1
kd → kubectl describe

# 示例2
kgdepallw → kubectl get deployment —all-namespaces —watch

alias get示例:

alias k='kubectl'
alias kg='kubectl get'
alias kgpo='kubectl get pods'
alias kgpoojson='kubectl get pods -o=json'
alias kgpon='kubectl get pods --namespace'
alias ksysgpooyamll='kubectl --namespace=kube-system get pods -o=yaml -l'

3. 安装

# 将 .kubectl_aliases下载到 home 目录
cd ~ && wget https://raw.githubusercontent.com/ahmetb/kubectl-aliases/master/.kubectl_aliases

# 将以下内容添加到 .bashrc中,并执行 source .bashrc
[ -f ~/.kubectl_aliases ] && source ~/.kubectl_aliases
function kubectl() { command kubectl $@; }

# 如果需要提示别名的完整命令,则将以下内容添加到 .bashrc中,并执行 source .bashrc
[ -f ~/.kubectl_aliases ] && source ~/.kubectl_aliases
function kubectl() { echo "+ kubectl $@"; command kubectl $@; }

参考:

8.3 - 节点调度

8.3.1 - 指定节点调度与隔离

1. NodeSelector

1.1. 概念

如果需要限制Pod到指定的Node上运行,则可以给Node打标签并给Pod配置NodeSelector。

1.2. 使用方式

1.2.1. 给Node打标签

# get node的name
kubectl get nodes

# 设置Label
kubectl label nodes <node-name> <label-key>=<label-value>
# 例如
kubectl label nodes node-1 disktype=ssd

# 查看Node的Label
kubectl get nodes --show-labels

# 删除Node的label
kubectl label node <node-name> <label-key>-

1.2.2. 给Pod设置NodeSelector

apiVersion: v1
kind: Pod
metadata:
  name: nginx
  labels:
    env: test
spec:
  containers:
  - name: nginx
    image: nginx
    imagePullPolicy: IfNotPresent
  nodeSelector:
    disktype: ssd    # 对应Node的Label

1.3. 亲和性(Affinity)和反亲和性(Anti-affinity)

待补充

2. Taint 和 Toleration

2.1. 概念

nodeSelector可以通过打标签的形式让Pod被调度到指定的Node上,Taint 则相反,它使节点能够排斥一类特定的Pod,除非Pod被指定了toleration的标签。(taint即污点,Node被打上污点;只有容忍[toleration]这些污点的Pod才可能被调度到该Node)。

2.2. 使用方式

2.2.1. kubectl taint

# 给节点增加一个taint,它的key是<key>,value是<value>,effect是NoSchedule。
kubectl taint nodes <node_name> <key>=<value>:NoSchedule

只有拥有和这个taint相匹配的toleration的pod才能够被分配到 node_name 这个节点。

例如,在 PodSpec 中定义 pod 的 toleration:

tolerations:
- key: "key"
  operator: "Equal"
  value: "value"
  effect: "NoSchedule"
tolerations:
- key: "key"
  operator: "Exists"
  effect: "NoSchedule"

2.2.2. 匹配规则:

一个 toleration 和一个 taint 相“匹配”是指它们有一样的 key 和 effect ,并且:

  • 如果 operatorExists (此时 toleration 不能指定 value
  • 如果 operatorEqual ,则它们的 value 应该相等

特殊情况:

  • 如果一个 toleration 的 key 为空且 operator 为 Exists ,表示这个 toleration 与任意的 key 、 value 和 effect 都匹配,即这个 toleration 能容忍任意 taint。

    tolerations:
    - operator: "Exists"
    
  • 如果一个 toleration 的 effect 为空,则 key 值与之相同的相匹配 taint 的 effect 可以是任意值。

    tolerations:
    - key: "key"
      operator: "Exists"
    

一个节点可以设置多个taint,一个pod也可以设置多个toleration。Kubernetes 处理多个 taint 和 toleration 的过程就像一个过滤器:从一个节点的所有 taint 开始遍历,过滤掉那些 pod 中存在与之相匹配的 toleration 的 taint。余下未被过滤的 taint 的 effect 值决定了 pod 是否会被分配到该节点,特别是以下情况:

  • 如果未被过滤的 taint 中存在一个以上 effect 值为 NoSchedule 的 taint,则 Kubernetes 不会将 pod 分配到该节点。
  • 如果未被过滤的 taint 中不存在 effect 值为 NoSchedule 的 taint,但是存在 effect 值为 PreferNoSchedule 的 taint,则 Kubernetes 会尝试将 pod 分配到该节点。
  • 如果未被过滤的 taint 中存在一个以上 effect 值为 NoExecute 的 taint,则 Kubernetes 不会将 pod 分配到该节点(如果 pod 还未在节点上运行),或者将 pod 从该节点驱逐(如果 pod 已经在节点上运行)。

2.2.3. effect的类型

  • NoSchedule:只有拥有和这个 taint 相匹配的 toleration 的 pod 才能够被分配到这个节点。

  • PreferNoSchedule:系统会尽量避免将 pod 调度到存在其不能容忍 taint 的节点上,但这不是强制的。

  • NoExecute :任何不能忍受这个 taint 的 pod 都会马上被驱逐,任何可以忍受这个 taint 的 pod 都不会被驱逐。Pod可指定属性 tolerationSeconds 的值,表示pod 还能继续在节点上运行的时间。

    tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoExecute"
      tolerationSeconds: 3600
    

2.3. 使用场景

2.3.1. 专用节点

kubectl taint nodes <nodename> dedicated=<groupName>:NoSchedule

先给Node添加taint,然后给Pod添加相对应的 toleration,则该Pod可调度到taint的Node,也可调度到其他节点。

如果想让Pod只调度某些节点且某些节点只接受对应的Pod,则需要在Node上添加Label(例如:dedicated=groupName),同时给Pod的nodeSelector添加对应的Label

2.3.2. 特殊硬件节点

如果某些节点配置了特殊硬件(例如CPU),希望不使用这些特殊硬件的Pod不被调度该Node,以便保留必要资源。即可给Node设置taintlabel,同时给Pod设置tolerationlabel来使得这些Node专门被指定Pod使用。

# kubectl taint
kubectl taint nodes nodename special=true:NoSchedule 
# 或者
kubectl taint nodes nodename special=true:PreferNoSchedule

2.3.3. 基于taint驱逐

effect 值 NoExecute ,它会影响已经在节点上运行的 pod,即根据策略对Pod进行驱逐。

  • 如果 pod 不能忍受effect 值为 NoExecute 的 taint,那么 pod 将马上被驱逐
  • 如果 pod 能够忍受effect 值为 NoExecute 的 taint,但是在 toleration 定义中没有指定 tolerationSeconds,则 pod 还会一直在这个节点上运行。
  • 如果 pod 能够忍受effect 值为 NoExecute 的 taint,而且指定了 tolerationSeconds,则 pod 还能在这个节点上继续运行这个指定的时间长度。

参考:

8.3.2 - 安全迁移节点

1. 迁移Pod

1.1. 设置节点是否可调度

确定需要迁移和被迁移的节点,将不允许被迁移的节点设置为不可调度。

# 查看节点
kubectl get nodes

# 设置节点为不可调度
kubectl cordon <NodeName>

# 设置节点为可调度
kubectl uncordon <NodeName>

1.2. 执行kubectl drain命令

kubectl drain <NodeName> --force --ignore-daemonsets

示例:

$ kubectl drain bjzw-prek8sredis-99-40 --force --ignore-daemonsets
node "bjzw-prek8sredis-99-40" already cordoned
WARNING: Deleting pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: kube-proxy-bjzw-prek8sredis-99-40; Ignoring DaemonSet-managed pods: calicoopsmonitor-mfpqs, arachnia-agent-j56n8
pod "pre-test-pro2-r-0-redis-2-8-19-1" evicted
pod "pre-test-hwh1-r-8-redis-2-8-19-2" evicted
pod "pre-eos-hdfs-vector-eos-hdfs-redis-2-8-19-0" evicted

1.3. 特别说明

对于statefulset创建的Pod,kubectl drain的说明如下:

kubectl drain操作会将相应节点上的旧Pod删除,并在可调度节点上面起一个对应的Pod。当旧Pod没有被正常删除的情况下,新Pod不会起来。例如:旧Pod一直处于Terminating状态。

对应的解决方式是通过重启相应节点的kubelet,或者强制删除该Pod。

示例:

# 重启发生`Terminating`节点的kubelet
systemctl restart kubelet

# 强制删除`Terminating`状态的Pod
kubectl delete pod <PodName> --namespace=<Namespace> --force --grace-period=0

2. kubectl drain 流程图

3. TroubleShooting

1、存在不是通过ReplicationController, ReplicaSet, Job, DaemonSet 或者 StatefulSet创建的Pod(即静态pod,通过文件方式创建的),所以需要设置强制执行的参数--force

$ kubectl drain bjzw-prek8sredis-99-40
node "bjzw-prek8sredis-99-40" already cordoned
error: unable to drain node "bjzw-prek8sredis-99-40", aborting command...

There are pending nodes to be drained:
 bjzw-prek8sredis-99-40
error: DaemonSet-managed pods (use --ignore-daemonsets to ignore): calicoopsmonitor-mfpqs, arachnia-agent-j56n8; pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet (use --force to override): kube-proxy-bjzw-prek8sredis-99-40

2、存在DaemonSet方式管理的Pod,需要设置--ignore-daemonsets参数忽略报错。

$ kubectl drain bjzw-prek8sredis-99-40 --force
node "bjzw-prek8sredis-99-40" already cordoned
error: unable to drain node "bjzw-prek8sredis-99-40", aborting command...

There are pending nodes to be drained:
 bjzw-prek8sredis-99-40
error: DaemonSet-managed pods (use --ignore-daemonsets to ignore): calicoopsmonitor-mfpqs, arachnia-agent-j56n8

4. kubectl drain

$ kubectl drain --help
Drain node in preparation for maintenance.

The given node will be marked unschedulable to prevent new pods from arriving. 'drain' evicts the pods if the APIServer
supports eviction (http://kubernetes.io/docs/admin/disruptions/). Otherwise, it will use normal DELETE to delete the
pods. The 'drain' evicts or deletes all pods except mirror pods (which cannot be deleted through the API server).  If
there are DaemonSet-managed pods, drain will not proceed without --ignore-daemonsets, and regardless it will not delete
any DaemonSet-managed pods, because those pods would be immediately replaced by the DaemonSet controller, which ignores
unschedulable markings.  If there are any pods that are neither mirror pods nor managed by ReplicationController,
ReplicaSet, DaemonSet, StatefulSet or Job, then drain will not delete any pods unless you use --force.  --force will
also allow deletion to proceed if the managing resource of one or more pods is missing.

'drain' waits for graceful termination. You should not operate on the machine until the command completes.

When you are ready to put the node back into service, use kubectl uncordon, which will make the node schedulable again.

! http://kubernetes.io/images/docs/kubectl_drain.svg

Examples:
  # Drain node "foo", even if there are pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or
StatefulSet on it.
  $ kubectl drain foo --force

  # As above, but abort if there are pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet or
StatefulSet, and use a grace period of 15 minutes.
  $ kubectl drain foo --grace-period=900

Options:
      --delete-local-data=false: Continue even if there are pods using emptyDir (local data that will be deleted when
the node is drained).
      --dry-run=false: If true, only print the object that would be sent, without sending it.
      --force=false: Continue even if there are pods not managed by a ReplicationController, ReplicaSet, Job, DaemonSet
or StatefulSet.
      --grace-period=-1: Period of time in seconds given to each pod to terminate gracefully. If negative, the default
value specified in the pod will be used.
      --ignore-daemonsets=false: Ignore DaemonSet-managed pods.
  -l, --selector='': Selector (label query) to filter on
      --timeout=0s: The length of time to wait before giving up, zero means infinite

Usage:
  kubectl drain NODE [options]

Use "kubectl options" for a list of global command-line options (applies to all commands).

参考文档:

8.4 - 镜像仓库

8.4.1 - 配置私有镜像仓库

1. 镜像仓库的基本操作

1.1. 登录镜像仓库

docker login -u <username> -p <password> <registry-addr>

1.2. 拉取镜像

docker pull https://registry.xxx.com/dev/nginx:latest

1.3. 推送镜像

docker push https://registry.xxx.com/dev/nginx:latest

1.4. 重命名镜像

docker tag <old-image> <new-image>

2. docker.xxx.com镜像仓库

使用docker.xxx.com镜像仓库。

2.1. 所有节点配置insecure-registries

#cat /etc/docker/daemon.json
{
  "data-root": "/data/docker",
  "debug": false,
  "insecure-registries": [
	...
    "docker.xxx.com:8080"
  ],
  ...
}

2.2. 所有节点配置/var/lib/kubelet/config.json

具体参考:configuring-nodes-to-authenticate-to-a-private-registry

  1. 在某个节点登录docker.xxx.com:8080镜像仓库,会更新 $HOME/.docker/config.json
  2. 检查$HOME/.docker/config.json是否有该镜像仓库的auth信息。
#cat ~/.docker/config.json
{
	"auths": {
		"docker.xxx.com:8080": {
			"auth": "<此处为凭证信息>"
		}
	},
	"HttpHeaders": {
		"User-Agent": "Docker-Client/18.09.9 (linux)"
	}
}
  1. $HOME/.docker/config.json拷贝到所有的Node节点上的/var/lib/kubelet/config.json
# 获取所有节点的IP
nodes=$(kubectl get nodes -o jsonpath='{range .items[*].status.addresses[?(@.type=="ExternalIP")]}{.address} {end}')
# 拷贝到所有节点
for n in $nodes; do scp ~/.docker/config.json root@$n:/var/lib/kubelet/config.json; done

2.3. 创建docker.xxx.com镜像的pod

指定镜像为:docker.xxx.com:8080/public/2048:latest

完整pod.yaml

apiVersion: apps/v1beta2
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  generation: 1
  labels:
    k8s-app: dockeroa-hub
    qcloud-app: dockeroa-hub
  name: dockeroa-hub
  namespace: test
spec:
  progressDeadlineSeconds: 600
  replicas: 3
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      k8s-app: dockeroa-hub
      qcloud-app: dockeroa-hub
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        k8s-app: dockeroa-hub
        qcloud-app: dockeroa-hub
    spec:
      containers:
      - image: docker.xxx.com:8080/public/2048:latest
        imagePullPolicy: Always
        name: game
        resources:
          limits:
            cpu: 500m
            memory: 1Gi
          requests:
            cpu: 250m
            memory: 256Mi
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      nodeName: 192.168.1.1
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

查看pod状态

#kgpoowide -n game
NAME                                     READY   STATUS    RESTARTS   AGE     IP             NODE            NOMINATED NODE   READINESS GATES
docker-oa-757bbbddb5-h6j7m               1/1     Running   0          14m     192.168.2.51   192.168.1.1    <none>           <none>
docker-oa-757bbbddb5-jp5dw               1/1     Running   0          14m     192.168.1.32   192.168.1.2    <none>           <none>
docker-oa-757bbbddb5-nlw9f               1/1     Running   0          14m     192.168.0.43   192.168.1.3   <none>           <none>

参考:

8.4.2 - 拉取私有镜像

本文介绍通过pod指定 ImagePullSecrets来拉取私有镜像仓库的镜像

1. 创建secret

secret是namespace级别的,创建时候需要指定namespace。

kubectl create secret docker-registry <name> --docker-server=DOCKER_REGISTRY_SERVER --docker-username=DOCKER_USER --docker-password=DOCKER_PASSWORD -n <NAMESPACE>

2. 添加ImagePullSecrets到serviceAccount

可以通过将ImagePullSecrets到serviceAccount的方式来自动给pod添加imagePullSecrets参数值。

serviceAccount同样是namespace级别,只对该namespace生效。

#kubectl get secrets -n dev
NAME                  TYPE                                  DATA   AGE
docker.xxxx.com         kubernetes.io/dockerconfigjson        1      6h23m

将ImagePullSecrets添加到serviceAccount对象中。

默认serviceAccount对象如下

#kubectl get serviceaccount default -n dev -o yaml
apiVersion: v1
kind: ServiceAccount
metadata:
  creationTimestamp: "2020-02-27T03:30:38Z"
  name: default
  namespace: dev
  resourceVersion: "11651567"
  selfLink: /api/v1/namespaces/dev/serviceaccounts/default
  uid: 85bcdd31-5911-11ea-9429-6c92bf3b7c33
secrets:
- name: default-token-s7wfn

编辑或修改serviceAccount内容,增加imagePullSecrets字段。

imagePullSecrets:
- name: docker.xxxx.com

kubectl edit serviceaccount default -n dev

修改后内容为:

apiVersion: v1
kind: ServiceAccount
metadata:
  creationTimestamp: "2020-02-27T03:30:38Z"
  name: default
  namespace: dev
  resourceVersion: "11651567"
  selfLink: /api/v1/namespaces/dev/serviceaccounts/default
  uid: 85bcdd31-5911-11ea-9429-6c92bf3b7c33
secrets:
- name: default-token-s7wfn
imagePullSecrets:
- name: docker.xxxx.com

3. 创建带有imagePullSecrets的pod

如果已经执行了第二步操作,添加ImagePullSecrets到serviceAccount,则无需在pod中指定imagePullSecrets参数,默认会自动添加。

如果没有添加ImagePullSecrets到serviceAccount,则在pod中指定imagePullSecrets参数引用创建的镜像仓库的secret。

spec:
  imagePullSecrets:
  - name: docker.xxxx.com

4. 说明

由于secret和serviceaccount对象是对namespace级别生效,因此不同的namespace需要再次创建和更新这两个对象。该场景适合不同用户具有独立的镜像仓库的密码,可以通过该方式创建不同的镜像密码使用的secret来拉取不同的镜像部署。

参考:

9 - 开发指南

9.1 - client-go的使用及源码分析

1. client-go简介

1.1 client-go说明

​ client-go是一个调用kubernetes集群资源对象API的客户端,即通过client-go实现对kubernetes集群中资源对象(包括deployment、service、ingress、replicaSet、pod、namespace、node等)的增删改查等操作。大部分对kubernetes进行前置API封装的二次开发都通过client-go这个第三方包来实现。

​ client-go官方文档:https://github.com/kubernetes/client-go

1.2 示例代码

git clone https://github.com/huweihuang/client-go.git
cd client-go
#保证本地HOME目录有配置kubernetes集群的配置文件
go run client-go.go

client-go.go

package main

import (
	"flag"
	"fmt"
	"os"
	"path/filepath"
	"time"

	metav1 "k8s.io/apimachinery/pkg/apis/meta/v1"
	"k8s.io/client-go/kubernetes"
	"k8s.io/client-go/tools/clientcmd"
)

func main() {
	var kubeconfig *string
	if home := homeDir(); home != "" {
		kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
	} else {
		kubeconfig = flag.String("kubeconfig", "", "absolute path to the kubeconfig file")
	}
	flag.Parse()
	// uses the current context in kubeconfig
	config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
	if err != nil {
		panic(err.Error())
	}
	// creates the clientset
	clientset, err := kubernetes.NewForConfig(config)
	if err != nil {
		panic(err.Error())
	}
	for {
		pods, err := clientset.CoreV1().Pods("").List(metav1.ListOptions{})
		if err != nil {
			panic(err.Error())
		}
		fmt.Printf("There are %d pods in the cluster\n", len(pods.Items))
		time.Sleep(10 * time.Second)
	}
}

func homeDir() string {
	if h := os.Getenv("HOME"); h != "" {
		return h
	}
	return os.Getenv("USERPROFILE") // windows
}

1.3 运行结果

➜ go run client-go.go
There are 9 pods in the cluster
There are 7 pods in the cluster
There are 7 pods in the cluster
There are 7 pods in the cluster
There are 7 pods in the cluster

2. client-go源码分析

client-go源码:https://github.com/kubernetes/client-go

client-go源码目录结构

  • The kubernetes package contains the clientset to access Kubernetes API.
  • The discovery package is used to discover APIs supported by a Kubernetes API server.
  • The dynamic package contains a dynamic client that can perform generic operations on arbitrary Kubernetes API objects.
  • The transport package is used to set up auth and start a connection.
  • The tools/cache package is useful for writing controllers.

2.1 kubeconfig

kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")

获取kubernetes配置文件kubeconfig的绝对路径。一般路径为$HOME/.kube/config。该文件主要用来配置本地连接的kubernetes集群。

config内容如下:

apiVersion: v1
clusters:
- cluster:
    server: http://<kube-master-ip>:8080
  name: k8s
contexts:
- context:
    cluster: k8s
    namespace: default
    user: ""
  name: default
current-context: default
kind: Config
preferences: {}
users: []

2.2 rest.config

通过参数(master的url或者kubeconfig路径)和BuildConfigFromFlags方法来获取rest.Config对象,一般是通过参数kubeconfig的路径。

config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)

BuildConfigFromFlags函数源码

k8s.io/client-go/tools/clientcmd/client_config.go

// BuildConfigFromFlags is a helper function that builds configs from a master
// url or a kubeconfig filepath. These are passed in as command line flags for cluster
// components. Warnings should reflect this usage. If neither masterUrl or kubeconfigPath
// are passed in we fallback to inClusterConfig. If inClusterConfig fails, we fallback
// to the default config.
func BuildConfigFromFlags(masterUrl, kubeconfigPath string) (*restclient.Config, error) {
	if kubeconfigPath == "" && masterUrl == "" {
		glog.Warningf("Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.")
		kubeconfig, err := restclient.InClusterConfig()
		if err == nil {
			return kubeconfig, nil
		}
		glog.Warning("error creating inClusterConfig, falling back to default config: ", err)
	}
	return NewNonInteractiveDeferredLoadingClientConfig(
		&ClientConfigLoadingRules{ExplicitPath: kubeconfigPath},
		&ConfigOverrides{ClusterInfo: clientcmdapi.Cluster{Server: masterUrl}}).ClientConfig()
}

2.3 clientset

通过*rest.Config参数和NewForConfig方法来获取clientset对象,clientset是多个client的集合,每个client可能包含不同版本的方法调用。

clientset, err := kubernetes.NewForConfig(config)

2.3.1 NewForConfig

NewForConfig函数就是初始化clientset中的每个client。

k8s.io/client-go/kubernetes/clientset.go

// NewForConfig creates a new Clientset for the given config.
func NewForConfig(c *rest.Config) (*Clientset, error) {
	configShallowCopy := *c
	...
	var cs Clientset
	cs.appsV1beta1, err = appsv1beta1.NewForConfig(&configShallowCopy)
	...
	cs.coreV1, err = corev1.NewForConfig(&configShallowCopy)
	...
}

2.3.2 clientset的结构体

k8s.io/client-go/kubernetes/clientset.go

// Clientset contains the clients for groups. Each group has exactly one
// version included in a Clientset.
type Clientset struct {
	*discovery.DiscoveryClient
	admissionregistrationV1alpha1 *admissionregistrationv1alpha1.AdmissionregistrationV1alpha1Client
	appsV1beta1                   *appsv1beta1.AppsV1beta1Client
	appsV1beta2                   *appsv1beta2.AppsV1beta2Client
	authenticationV1              *authenticationv1.AuthenticationV1Client
	authenticationV1beta1         *authenticationv1beta1.AuthenticationV1beta1Client
	authorizationV1               *authorizationv1.AuthorizationV1Client
	authorizationV1beta1          *authorizationv1beta1.AuthorizationV1beta1Client
	autoscalingV1                 *autoscalingv1.AutoscalingV1Client
	autoscalingV2beta1            *autoscalingv2beta1.AutoscalingV2beta1Client
	batchV1                       *batchv1.BatchV1Client
	batchV1beta1                  *batchv1beta1.BatchV1beta1Client
	batchV2alpha1                 *batchv2alpha1.BatchV2alpha1Client
	certificatesV1beta1           *certificatesv1beta1.CertificatesV1beta1Client
	coreV1                        *corev1.CoreV1Client
	extensionsV1beta1             *extensionsv1beta1.ExtensionsV1beta1Client
	networkingV1                  *networkingv1.NetworkingV1Client
	policyV1beta1                 *policyv1beta1.PolicyV1beta1Client
	rbacV1                        *rbacv1.RbacV1Client
	rbacV1beta1                   *rbacv1beta1.RbacV1beta1Client
	rbacV1alpha1                  *rbacv1alpha1.RbacV1alpha1Client
	schedulingV1alpha1            *schedulingv1alpha1.SchedulingV1alpha1Client
	settingsV1alpha1              *settingsv1alpha1.SettingsV1alpha1Client
	storageV1beta1                *storagev1beta1.StorageV1beta1Client
	storageV1                     *storagev1.StorageV1Client
}

2.3.3 clientset.Interface

clientset实现了以下的Interface,因此可以通过调用以下方法获得具体的client。例如:

pods, err := clientset.CoreV1().Pods("").List(metav1.ListOptions{})

clientset的方法集接口

k8s.io/client-go/kubernetes/clientset.go

type Interface interface {
	Discovery() discovery.DiscoveryInterface
	AdmissionregistrationV1alpha1() admissionregistrationv1alpha1.AdmissionregistrationV1alpha1Interface
	// Deprecated: please explicitly pick a version if possible.
	Admissionregistration() admissionregistrationv1alpha1.AdmissionregistrationV1alpha1Interface
	AppsV1beta1() appsv1beta1.AppsV1beta1Interface
	AppsV1beta2() appsv1beta2.AppsV1beta2Interface
	// Deprecated: please explicitly pick a version if possible.
	Apps() appsv1beta2.AppsV1beta2Interface
	AuthenticationV1() authenticationv1.AuthenticationV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Authentication() authenticationv1.AuthenticationV1Interface
	AuthenticationV1beta1() authenticationv1beta1.AuthenticationV1beta1Interface
	AuthorizationV1() authorizationv1.AuthorizationV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Authorization() authorizationv1.AuthorizationV1Interface
	AuthorizationV1beta1() authorizationv1beta1.AuthorizationV1beta1Interface
	AutoscalingV1() autoscalingv1.AutoscalingV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Autoscaling() autoscalingv1.AutoscalingV1Interface
	AutoscalingV2beta1() autoscalingv2beta1.AutoscalingV2beta1Interface
	BatchV1() batchv1.BatchV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Batch() batchv1.BatchV1Interface
	BatchV1beta1() batchv1beta1.BatchV1beta1Interface
	BatchV2alpha1() batchv2alpha1.BatchV2alpha1Interface
	CertificatesV1beta1() certificatesv1beta1.CertificatesV1beta1Interface
	// Deprecated: please explicitly pick a version if possible.
	Certificates() certificatesv1beta1.CertificatesV1beta1Interface
	CoreV1() corev1.CoreV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Core() corev1.CoreV1Interface
	ExtensionsV1beta1() extensionsv1beta1.ExtensionsV1beta1Interface
	// Deprecated: please explicitly pick a version if possible.
	Extensions() extensionsv1beta1.ExtensionsV1beta1Interface
	NetworkingV1() networkingv1.NetworkingV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Networking() networkingv1.NetworkingV1Interface
	PolicyV1beta1() policyv1beta1.PolicyV1beta1Interface
	// Deprecated: please explicitly pick a version if possible.
	Policy() policyv1beta1.PolicyV1beta1Interface
	RbacV1() rbacv1.RbacV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Rbac() rbacv1.RbacV1Interface
	RbacV1beta1() rbacv1beta1.RbacV1beta1Interface
	RbacV1alpha1() rbacv1alpha1.RbacV1alpha1Interface
	SchedulingV1alpha1() schedulingv1alpha1.SchedulingV1alpha1Interface
	// Deprecated: please explicitly pick a version if possible.
	Scheduling() schedulingv1alpha1.SchedulingV1alpha1Interface
	SettingsV1alpha1() settingsv1alpha1.SettingsV1alpha1Interface
	// Deprecated: please explicitly pick a version if possible.
	Settings() settingsv1alpha1.SettingsV1alpha1Interface
	StorageV1beta1() storagev1beta1.StorageV1beta1Interface
	StorageV1() storagev1.StorageV1Interface
	// Deprecated: please explicitly pick a version if possible.
	Storage() storagev1.StorageV1Interface
}

2.4 CoreV1Client

我们以clientset中的CoreV1Client为例做分析。

通过传入的配置信息rest.Config初始化CoreV1Client对象。

k8s.io/client-go/kubernetes/clientset.go

cs.coreV1, err = corev1.NewForConfig(&configShallowCopy)

2.4.1 corev1.NewForConfig

k8s.io/client-go/kubernetes/typed/core/v1/core_client.go

// NewForConfig creates a new CoreV1Client for the given config.
func NewForConfig(c *rest.Config) (*CoreV1Client, error) {
	config := *c
	if err := setConfigDefaults(&config); err != nil {
		return nil, err
	}
	client, err := rest.RESTClientFor(&config)
	if err != nil {
		return nil, err
	}
	return &CoreV1Client{client}, nil
}

corev1.NewForConfig方法本质是调用了rest.RESTClientFor(&config)方法创建RESTClient对象,即CoreV1Client的本质就是一个RESTClient对象。

2.4.2 CoreV1Client结构体

以下是CoreV1Client结构体的定义:

k8s.io/client-go/kubernetes/typed/core/v1/core_client.go

// CoreV1Client is used to interact with features provided by the  group.
type CoreV1Client struct {
	restClient rest.Interface
}

CoreV1Client实现了CoreV1Interface的接口,即以下方法,从而对kubernetes的资源对象进行增删改查的操作。

k8s.io/client-go/kubernetes/typed/core/v1/core_client.go

//CoreV1Client的方法
func (c *CoreV1Client) ComponentStatuses() ComponentStatusInterface {...}
//ConfigMaps
func (c *CoreV1Client) ConfigMaps(namespace string) ConfigMapInterface {...}
//Endpoints
func (c *CoreV1Client) Endpoints(namespace string) EndpointsInterface {...}
func (c *CoreV1Client) Events(namespace string) EventInterface {...}
func (c *CoreV1Client) LimitRanges(namespace string) LimitRangeInterface {...}
//Namespaces
func (c *CoreV1Client) Namespaces() NamespaceInterface {...}
//Nodes
func (c *CoreV1Client) Nodes() NodeInterface {...}
func (c *CoreV1Client) PersistentVolumes() PersistentVolumeInterface {...}
func (c *CoreV1Client) PersistentVolumeClaims(namespace string) PersistentVolumeClaimInterface {...}
//Pods
func (c *CoreV1Client) Pods(namespace string) PodInterface {...}
func (c *CoreV1Client) PodTemplates(namespace string) PodTemplateInterface {...}
//ReplicationControllers
func (c *CoreV1Client) ReplicationControllers(namespace string) ReplicationControllerInterface {...}
func (c *CoreV1Client) ResourceQuotas(namespace string) ResourceQuotaInterface {...}
func (c *CoreV1Client) Secrets(namespace string) SecretInterface {...}
//Services
func (c *CoreV1Client) Services(namespace string) ServiceInterface {...}
func (c *CoreV1Client) ServiceAccounts(namespace string) ServiceAccountInterface {...}

2.4.3 CoreV1Interface

k8s.io/client-go/kubernetes/typed/core/v1/core_client.go

type CoreV1Interface interface {
	RESTClient() rest.Interface
	ComponentStatusesGetter
	ConfigMapsGetter
	EndpointsGetter
	EventsGetter
	LimitRangesGetter
	NamespacesGetter
	NodesGetter
	PersistentVolumesGetter
	PersistentVolumeClaimsGetter
	PodsGetter
	PodTemplatesGetter
	ReplicationControllersGetter
	ResourceQuotasGetter
	SecretsGetter
	ServicesGetter
	ServiceAccountsGetter
}

CoreV1Interface中包含了各种kubernetes对象的调用接口,例如PodsGetter是对kubernetes中pod对象增删改查操作的接口。ServicesGetter是对service对象的操作的接口。

2.4.4 PodsGetter

以下我们以PodsGetter接口为例分析CoreV1Clientpod对象的增删改查接口调用。

示例中的代码如下:

pods, err := clientset.CoreV1().Pods("").List(metav1.ListOptions{})

CoreV1().Pods()

k8s.io/client-go/kubernetes/typed/core/v1/core_client.go

func (c *CoreV1Client) Pods(namespace string) PodInterface {
	return newPods(c, namespace)
}

newPods()

k8s.io/client-go/kubernetes/typed/core/v1/pod.go

// newPods returns a Pods
func newPods(c *CoreV1Client, namespace string) *pods {
	return &pods{
		client: c.RESTClient(),
		ns:     namespace,
	}
}

CoreV1().Pods()的方法实际上是调用了newPods()的方法,创建了一个pods对象,pods对象继承了rest.Interface接口,即最终的实现本质是RESTClient的HTTP调用。

k8s.io/client-go/kubernetes/typed/core/v1/pod.go

// pods implements PodInterface
type pods struct {
	client rest.Interface
	ns     string
}

pods对象实现了PodInterface接口。PodInterface定义了pods对象的增删改查等方法。

k8s.io/client-go/kubernetes/typed/core/v1/pod.go

// PodInterface has methods to work with Pod resources.
type PodInterface interface {
	Create(*v1.Pod) (*v1.Pod, error)
	Update(*v1.Pod) (*v1.Pod, error)
	UpdateStatus(*v1.Pod) (*v1.Pod, error)
	Delete(name string, options *meta_v1.DeleteOptions) error
	DeleteCollection(options *meta_v1.DeleteOptions, listOptions meta_v1.ListOptions) error
	Get(name string, options meta_v1.GetOptions) (*v1.Pod, error)
	List(opts meta_v1.ListOptions) (*v1.PodList, error)
	Watch(opts meta_v1.ListOptions) (watch.Interface, error)
	Patch(name string, pt types.PatchType, data []byte, subresources ...string) (result *v1.Pod, err error)
	PodExpansion
}

PodsGetter

PodsGetter继承了PodInterface的接口。

k8s.io/client-go/kubernetes/typed/core/v1/pod.go

// PodsGetter has a method to return a PodInterface.
// A group's client should implement this interface.
type PodsGetter interface {
	Pods(namespace string) PodInterface
}

Pods().List()

pods.List()方法通过RESTClient的HTTP调用来实现对kubernetes的pod资源的获取。

k8s.io/client-go/kubernetes/typed/core/v1/pod.go

// List takes label and field selectors, and returns the list of Pods that match those selectors.
func (c *pods) List(opts meta_v1.ListOptions) (result *v1.PodList, err error) {
	result = &v1.PodList{}
	err = c.client.Get().
		Namespace(c.ns).
		Resource("pods").
		VersionedParams(&opts, scheme.ParameterCodec).
		Do().
		Into(result)
	return
}

以上分析了clientset.CoreV1().Pods("").List(metav1.ListOptions{})对pod资源获取的过程,最终是调用RESTClient的方法实现。

2.5 RESTClient

以下分析RESTClient的创建过程及作用。

RESTClient对象的创建同样是依赖传入的config信息。

k8s.io/client-go/kubernetes/typed/core/v1/core_client.go

client, err := rest.RESTClientFor(&config)

2.5.1 rest.RESTClientFor

k8s.io/client-go/rest/config.go

// RESTClientFor returns a RESTClient that satisfies the requested attributes on a client Config
// object. Note that a RESTClient may require fields that are optional when initializing a Client.
// A RESTClient created by this method is generic - it expects to operate on an API that follows
// the Kubernetes conventions, but may not be the Kubernetes API.
func RESTClientFor(config *Config) (*RESTClient, error) {
	...
	qps := config.QPS
	...
	burst := config.Burst
	...
	baseURL, versionedAPIPath, err := defaultServerUrlFor(config)
	...
	transport, err := TransportFor(config)
	...
	var httpClient *http.Client
	if transport != http.DefaultTransport {
		httpClient = &http.Client{Transport: transport}
		if config.Timeout > 0 {
			httpClient.Timeout = config.Timeout
		}
	}

	return NewRESTClient(baseURL, versionedAPIPath, config.ContentConfig, qps, burst, config.RateLimiter, httpClient)
}

RESTClientFor函数调用了NewRESTClient的初始化函数。

2.5.2 NewRESTClient

k8s.io/client-go/rest/client.go

// NewRESTClient creates a new RESTClient. This client performs generic REST functions
// such as Get, Put, Post, and Delete on specified paths.  Codec controls encoding and
// decoding of responses from the server.
func NewRESTClient(baseURL *url.URL, versionedAPIPath string, config ContentConfig, maxQPS float32, maxBurst int, rateLimiter flowcontrol.RateLimiter, client *http.Client) (*RESTClient, error) {
	base := *baseURL
	...
	serializers, err := createSerializers(config)
	...
	return &RESTClient{
		base:             &base,
		versionedAPIPath: versionedAPIPath,
		contentConfig:    config,
		serializers:      *serializers,
		createBackoffMgr: readExpBackoffConfig,
		Throttle:         throttle,
		Client:           client,
	}, nil
}

2.5.3 RESTClient结构体

以下介绍RESTClient的结构体定义,RESTClient结构体中包含了http.Client,即本质上RESTClient就是一个http.Client的封装实现。

k8s.io/client-go/rest/client.go

// RESTClient imposes common Kubernetes API conventions on a set of resource paths.
// The baseURL is expected to point to an HTTP or HTTPS path that is the parent
// of one or more resources.  The server should return a decodable API resource
// object, or an api.Status object which contains information about the reason for
// any failure.
//
// Most consumers should use client.New() to get a Kubernetes API client.
type RESTClient struct {
	// base is the root URL for all invocations of the client
	base *url.URL
	// versionedAPIPath is a path segment connecting the base URL to the resource root
	versionedAPIPath string

	// contentConfig is the information used to communicate with the server.
	contentConfig ContentConfig

	// serializers contain all serializers for underlying content type.
	serializers Serializers

	// creates BackoffManager that is passed to requests.
	createBackoffMgr func() BackoffManager

	// TODO extract this into a wrapper interface via the RESTClient interface in kubectl.
	Throttle flowcontrol.RateLimiter

	// Set specific behavior of the client.  If not set http.DefaultClient will be used.
	Client *http.Client
}

2.5.4 RESTClient.Interface

RESTClient实现了以下的接口方法:

k8s.io/client-go/rest/client.go

// Interface captures the set of operations for generically interacting with Kubernetes REST apis.
type Interface interface {
	GetRateLimiter() flowcontrol.RateLimiter
	Verb(verb string) *Request
	Post() *Request
	Put() *Request
	Patch(pt types.PatchType) *Request
	Get() *Request
	Delete() *Request
	APIVersion() schema.GroupVersion
}

在调用HTTP方法(Post(),Put(),Get(),Delete() )时,实际上调用了Verb(verb string)函数。

k8s.io/client-go/rest/client.go

// Verb begins a request with a verb (GET, POST, PUT, DELETE).
//
// Example usage of RESTClient's request building interface:
// c, err := NewRESTClient(...)
// if err != nil { ... }
// resp, err := c.Verb("GET").
//  Path("pods").
//  SelectorParam("labels", "area=staging").
//  Timeout(10*time.Second).
//  Do()
// if err != nil { ... }
// list, ok := resp.(*api.PodList)
//
func (c *RESTClient) Verb(verb string) *Request {
	backoff := c.createBackoffMgr()

	if c.Client == nil {
		return NewRequest(nil, verb, c.base, c.versionedAPIPath, c.contentConfig, c.serializers, backoff, c.Throttle)
	}
	return NewRequest(c.Client, verb, c.base, c.versionedAPIPath, c.contentConfig, c.serializers, backoff, c.Throttle)
}

Verb函数调用了NewRequest方法,最后调用Do()方法实现一个HTTP请求获取Result。

2.6 总结

client-go对kubernetes资源对象的调用,需要先获取kubernetes的配置信息,即$HOME/.kube/config

整个调用的过程如下:

kubeconfig→rest.config→clientset→具体的client(CoreV1Client)→具体的资源对象(pod)→RESTClient→http.Client→HTTP请求的发送及响应

通过clientset中不同的client和client中不同资源对象的方法实现对kubernetes中资源对象的增删改查等操作,常用的client有CoreV1ClientAppsV1beta1ClientExtensionsV1beta1Client等。

3. client-go对k8s资源的调用

创建clientset

//获取kubeconfig
kubeconfig = flag.String("kubeconfig", filepath.Join(home, ".kube", "config"), "(optional) absolute path to the kubeconfig file")
//创建config	
config, err := clientcmd.BuildConfigFromFlags("", *kubeconfig)
//创建clientset
clientset, err := kubernetes.NewForConfig(config)
//具体的资源调用见以下例子

3.1 deployment

//声明deployment对象
var deployment *v1beta1.Deployment
//构造deployment对象
//创建deployment
deployment, err := clientset.AppsV1beta1().Deployments(<namespace>).Create(<deployment>)
//更新deployment
deployment, err := clientset.AppsV1beta1().Deployments(<namespace>).Update(<deployment>)
//删除deployment
err := clientset.AppsV1beta1().Deployments(<namespace>).Delete(<deployment.Name>, &meta_v1.DeleteOptions{})
//查询deployment
deployment, err := clientset.AppsV1beta1().Deployments(<namespace>).Get(<deployment.Name>, meta_v1.GetOptions{})
//列出deployment
deploymentList, err := clientset.AppsV1beta1().Deployments(<namespace>).List(&meta_v1.ListOptions{})
//watch deployment
watchInterface, err := clientset.AppsV1beta1().Deployments(<namespace>).Watch(&meta_v1.ListOptions{})

3.2 service

//声明service对象
var service *v1.Service
//构造service对象
//创建service
service, err := clientset.CoreV1().Services(<namespace>).Create(<service>)
//更新service
service, err := clientset.CoreV1().Services(<namespace>).Update(<service>)
//删除service
err := clientset.CoreV1().Services(<namespace>).Delete(<service.Name>, &meta_v1.DeleteOptions{})
//查询service
service, err := clientset.CoreV1().Services(<namespace>).Get(<service.Name>, meta_v1.GetOptions{})
//列出service
serviceList, err := clientset.CoreV1().Services(<namespace>).List(&meta_v1.ListOptions{})
//watch service
watchInterface, err := clientset.CoreV1().Services(<namespace>).Watch(&meta_v1.ListOptions{})

3.3 ingress

//声明ingress对象
var ingress *v1beta1.Ingress
//构造ingress对象
//创建ingress
ingress, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Create(<ingress>)
//更新ingress
ingress, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Update(<ingress>)
//删除ingress
err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Delete(<ingress.Name>, &meta_v1.DeleteOptions{})
//查询ingress
ingress, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Get(<ingress.Name>, meta_v1.GetOptions{})
//列出ingress
ingressList, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).List(&meta_v1.ListOptions{})
//watch ingress
watchInterface, err := clientset.ExtensionsV1beta1().Ingresses(<namespace>).Watch(&meta_v1.ListOptions{})

3.4 replicaSet

//声明replicaSet对象
var replicaSet *v1beta1.ReplicaSet
//构造replicaSet对象
//创建replicaSet
replicaSet, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Create(<replicaSet>)
//更新replicaSet
replicaSet, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Update(<replicaSet>)
//删除replicaSet
err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Delete(<replicaSet.Name>, &meta_v1.DeleteOptions{})
//查询replicaSet
replicaSet, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Get(<replicaSet.Name>, meta_v1.GetOptions{})
//列出replicaSet
replicaSetList, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).List(&meta_v1.ListOptions{})
//watch replicaSet
watchInterface, err := clientset.ExtensionsV1beta1().ReplicaSets(<namespace>).Watch(&meta_v1.ListOptions{})

新版的kubernetes中一般通过deployment来创建replicaSet,再通过replicaSet来控制pod。

3.5 pod

//声明pod对象
var pod *v1.Pod
//创建pod
pod, err := clientset.CoreV1().Pods(<namespace>).Create(<pod>)
//更新pod
pod, err := clientset.CoreV1().Pods(<namespace>).Update(<pod>)
//删除pod
err := clientset.CoreV1().Pods(<namespace>).Delete(<pod.Name>, &meta_v1.DeleteOptions{})
//查询pod
pod, err := clientset.CoreV1().Pods(<namespace>).Get(<pod.Name>, meta_v1.GetOptions{})
//列出pod
podList, err := clientset.CoreV1().Pods(<namespace>).List(&meta_v1.ListOptions{})
//watch pod
watchInterface, err := clientset.CoreV1().Pods(<namespace>).Watch(&meta_v1.ListOptions{})

3.6 statefulset

//声明statefulset对象
var statefulset *v1.StatefulSet
//创建statefulset
statefulset, err := clientset.AppsV1().StatefulSets(<namespace>).Create(<statefulset>)
//更新statefulset
statefulset, err := clientset.AppsV1().StatefulSets(<namespace>).Update(<statefulset>)
//删除statefulset
err := clientset.AppsV1().StatefulSets(<namespace>).Delete(<statefulset.Name>, &meta_v1.DeleteOptions{})
//查询statefulset
statefulset, err := clientset.AppsV1().StatefulSets(<namespace>).Get(<statefulset.Name>, meta_v1.GetOptions{})
//列出statefulset
statefulsetList, err := clientset.AppsV1().StatefulSets(<namespace>).List(&meta_v1.ListOptions{})
//watch statefulset
watchInterface, err := clientset.AppsV1().StatefulSets(<namespace>).Watch(&meta_v1.ListOptions{})

​ 通过以上对kubernetes的资源对象的操作函数可以看出,每个资源对象都有增删改查等方法,基本调用逻辑类似。一般二次开发只需要创建deployment、service、ingress三个资源对象即可,pod对象由deployment包含的replicaSet来控制创建和删除。函数调用的入参一般只有NAMESPACEkubernetesObject两个参数,部分操作有Options的参数。在创建前,需要对资源对象构造数据,可以理解为编辑一个资源对象的yaml文件,然后通过kubectl create -f xxx.yaml来创建对象。

参考文档:

9.2 - operator开发

9.2.1 - kubebuilder的使用

1. kubebuilder

1.1. 安装kubebuilder

# download kubebuilder and install locally.
curl -L -o kubebuilder https://go.kubebuilder.io/dl/latest/$(go env GOOS)/$(go env GOARCH)
chmod +x kubebuilder && mv kubebuilder /usr/local/bin/

1.2. kubebuilder命令

Development kit for building Kubernetes extensions and tools.

Provides libraries and tools to create new projects, APIs and controllers.
Includes tools for packaging artifacts into an installer container.

Typical project lifecycle:

- initialize a project:

  kubebuilder init --domain example.com --license apache2 --owner "The Kubernetes authors"

- create one or more a new resource APIs and add your code to them:

  kubebuilder create api --group <group> --version <version> --kind <Kind>

Create resource will prompt the user for if it should scaffold the Resource and / or Controller. To only
scaffold a Controller for an existing Resource, select "n" for Resource. To only define
the schema for a Resource without writing a Controller, select "n" for Controller.

After the scaffold is written, api will run make on the project.

Usage:
  kubebuilder [command]

Available Commands:
  create      Scaffold a Kubernetes API or webhook.
  edit        This command will edit the project configuration
  help        Help about any command
  init        Initialize a new project
  version     Print the kubebuilder version

Flags:
  -h, --help   help for kubebuilder

Use "kubebuilder [command] --help" for more information about a command.

2. 操作步骤

2.1. 初始化

mkdir $GOPATH/src/github.com/huweihuang/operator-example
cd $GOPATH/src/github.com/huweihuang/operator-example

go mod init github.com/huweihuang/operator-example

2.2. 创建项目

# kubebuilder init --domain github.com --license apache2 --owner "Hu Weihuang"
Writing scaffold for you to edit...
Get controller runtime:
$ go get sigs.k8s.io/controller-runtime@v0.5.0
Update go.mod:
$ go mod tidy
Running make:
$ make
go: creating new go.mod: module tmp
go: finding sigs.k8s.io v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd/controller-gen v0.2.5
/Users/weihuanghu/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
go build -o bin/manager main.go
Next: define a resource with:
$ kubebuilder create api

查看生成文件:

./
├── Dockerfile
├── Makefile
├── PROJECT
├── bin
│   └── manager
├── config
│   ├── certmanager
│   │   ├── certificate.yaml
│   │   ├── kustomization.yaml
│   │   └── kustomizeconfig.yaml
│   ├── default
│   │   ├── kustomization.yaml
│   │   ├── manager_auth_proxy_patch.yaml
│   │   ├── manager_webhook_patch.yaml
│   │   └── webhookcainjection_patch.yaml
│   ├── manager
│   │   ├── kustomization.yaml
│   │   └── manager.yaml
│   ├── prometheus
│   │   ├── kustomization.yaml
│   │   └── monitor.yaml
│   ├── rbac
│   │   ├── auth_proxy_client_clusterrole.yaml
│   │   ├── auth_proxy_role.yaml
│   │   ├── auth_proxy_role_binding.yaml
│   │   ├── auth_proxy_service.yaml
│   │   ├── kustomization.yaml
│   │   ├── leader_election_role.yaml
│   │   ├── leader_election_role_binding.yaml
│   │   └── role_binding.yaml
│   └── webhook
│       ├── kustomization.yaml
│       ├── kustomizeconfig.yaml
│       └── service.yaml
├── go.mod
├── go.sum
├── hack
│   └── boilerplate.go.txt
└── main.go

2.3. 创建API

# kubebuilder create api --group webapp --version v1 --kind Guestbook
Create Resource [y/n]
y
Create Controller [y/n]
y
Writing scaffold for you to edit...
api/v1/guestbook_types.go
controllers/guestbook_controller.go
Running make:
$ make
go: creating new go.mod: module tmp
go: finding sigs.k8s.io/controller-tools/cmd v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd/controller-gen v0.2.5
go: finding sigs.k8s.io v0.2.5
/Users/weihuanghu/go/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
go fmt ./...
go vet ./...
go build -o bin/manager main.go

查看创建文件

api
└── v1
    ├── groupversion_info.go
    ├── guestbook_types.go
    └── zz_generated.deepcopy.go
controllers
├── guestbook_controller.go
└── suite_test.go

查看api/v1/guestbook_types.go

// GuestbookSpec defines the desired state of Guestbook
type GuestbookSpec struct {
    // INSERT ADDITIONAL SPEC FIELDS - desired state of cluster
    // Important: Run "make" to regenerate code after modifying this file

    // Quantity of instances
    // +kubebuilder:validation:Minimum=1
    // +kubebuilder:validation:Maximum=10
    Size int32 `json:"size"`

    // Name of the ConfigMap for GuestbookSpec's configuration
    // +kubebuilder:validation:MaxLength=15
    // +kubebuilder:validation:MinLength=1
    ConfigMapName string `json:"configMapName"`

    // +kubebuilder:validation:Enum=Phone;Address;Name
    Type string `json:"alias,omitempty"`
}

// GuestbookStatus defines the observed state of Guestbook
type GuestbookStatus struct {
    // INSERT ADDITIONAL STATUS FIELD - define observed state of cluster
    // Important: Run "make" to regenerate code after modifying this file

    // PodName of the active Guestbook node.
    Active string `json:"active"`

    // PodNames of the standby Guestbook nodes.
    Standby []string `json:"standby"`
}

// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:resource:scope=Cluster

// Guestbook is the Schema for the guestbooks API
type Guestbook struct {
    metav1.TypeMeta   `json:",inline"`
    metav1.ObjectMeta `json:"metadata,omitempty"`

    Spec   GuestbookSpec   `json:"spec,omitempty"`
    Status GuestbookStatus `json:"status,omitempty"`
}

3. troubleshooting

3.1. controller-gen: No such file or directory

➜  operator-example kubebuilder init --domain github.com --license apache2 --owner "Hu Weihuang"
Writing scaffold for you to edit...
Get controller runtime:
$ go get sigs.k8s.io/controller-runtime@v0.5.0
Update go.mod:
$ go mod tidy
Running make:
$ make
go: creating new go.mod: module tmp
go: finding sigs.k8s.io v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd v0.2.5
go: finding sigs.k8s.io/controller-tools/cmd/controller-gen v0.2.5
/Users/weihuanghu/go:/Users/weihuanghu/k8spath/bin/controller-gen object:headerFile="hack/boilerplate.go.txt" paths="./..."
/bin/sh: /Users/weihuanghu/go:/Users/weihuanghu/k8spath/bin/controller-gen: No such file or directory
make: *** [generate] Error 127
2020/04/13 14:34:47 failed to initialize project: exit status 2

由于本地存在多个GOPATH的目录,而获取了非当前项目下的GOPATH目录,因此将当前项目所在的GOPATH目录export到GOPATH环境变量中,就可以解决。

export GOPATH="/path/to/gopath"

参考:

9.3 - CSI插件开发

9.3.1 - csi-provisioner源码分析

本文主要分析csi-provisioner的源码,关于开发一个Dynamic Provisioner,具体可参考nfs-client-provisioner的源码分析

1. Dynamic Provisioner

1.1. Provisioner Interface

开发Dynamic Provisioner需要实现Provisioner接口,该接口有两个方法,分别是:

  • Provision:创建存储资源,并且返回一个PV对象。
  • Delete:移除对应的存储资源,但并没有删除PV对象。

1.2. 开发provisioner的步骤

  1. 写一个provisioner实现Provisioner接口(包含ProvisionDelete的方法)。
  2. 通过该provisioner构建ProvisionController
  3. 执行ProvisionControllerRun方法。

2. CSI Provisioner

CSI Provisioner的源码可参考:https://github.com/kubernetes-csi/external-provisioner。

2.1. Main 函数

2.1.1. 读取环境变量

源码如下:

var (
	provisioner          = flag.String("provisioner", "", "Name of the provisioner. The provisioner will only provision volumes for claims that request a StorageClass with a provisioner field set equal to this name.")
	master               = flag.String("master", "", "Master URL to build a client config from. Either this or kubeconfig needs to be set if the provisioner is being run out of cluster.")
	kubeconfig           = flag.String("kubeconfig", "", "Absolute path to the kubeconfig file. Either this or master needs to be set if the provisioner is being run out of cluster.")
	csiEndpoint          = flag.String("csi-address", "/run/csi/socket", "The gRPC endpoint for Target CSI Volume")
	connectionTimeout    = flag.Duration("connection-timeout", 10*time.Second, "Timeout for waiting for CSI driver socket.")
	volumeNamePrefix     = flag.String("volume-name-prefix", "pvc", "Prefix to apply to the name of a created volume")
	volumeNameUUIDLength = flag.Int("volume-name-uuid-length", -1, "Truncates generated UUID of a created volume to this length. Defaults behavior is to NOT truncate.")
	showVersion          = flag.Bool("version", false, "Show version.")

	provisionController *controller.ProvisionController
	version             = "unknown"
)

func init() {
	var config *rest.Config
	var err error

	flag.Parse()
	flag.Set("logtostderr", "true")

	if *showVersion {
		fmt.Println(os.Args[0], version)
		os.Exit(0)
	}
	glog.Infof("Version: %s", version)
	...	
}	

通过init函数解析相关参数,其实provisioner指明为PVC提供PV的provisioner的名字,需要和StorageClass对象中的provisioner字段一致。

2.1.2. 获取clientset对象

源码如下:

// get the KUBECONFIG from env if specified (useful for local/debug cluster)
kubeconfigEnv := os.Getenv("KUBECONFIG")
if kubeconfigEnv != "" {
	glog.Infof("Found KUBECONFIG environment variable set, using that..")
	kubeconfig = &kubeconfigEnv
}
if *master != "" || *kubeconfig != "" {
	glog.Infof("Either master or kubeconfig specified. building kube config from that..")
	config, err = clientcmd.BuildConfigFromFlags(*master, *kubeconfig)
} else {
	glog.Infof("Building kube configs for running in cluster...")
	config, err = rest.InClusterConfig()
}
if err != nil {
	glog.Fatalf("Failed to create config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
	glog.Fatalf("Failed to create client: %v", err)
}

// snapclientset.NewForConfig creates a new Clientset for VolumesnapshotV1alpha1Client
snapClient, err := snapclientset.NewForConfig(config)
if err != nil {
	glog.Fatalf("Failed to create snapshot client: %v", err)
}
csiAPIClient, err := csiclientset.NewForConfig(config)
if err != nil {
	glog.Fatalf("Failed to create CSI API client: %v", err)
}

通过读取对应的k8s的配置,创建clientset对象,用来执行k8s对应的API,其中主要包括对PV和PVC等对象的创建删除等操作。

2.1.3. k8s版本校验

// The controller needs to know what the server version is because out-of-tree
// provisioners aren't officially supported until 1.5
serverVersion, err := clientset.Discovery().ServerVersion()
if err != nil {
	glog.Fatalf("Error getting server version: %v", err)
}

获取了k8s的版本信息,因为provisioners的功能在k8s 1.5及以上版本才支持。

2.1.4. 连接 csi socket

// Generate a unique ID for this provisioner
timeStamp := time.Now().UnixNano() / int64(time.Millisecond)
identity := strconv.FormatInt(timeStamp, 10) + "-" + strconv.Itoa(rand.Intn(10000)) + "-" + *provisioner

// Provisioner will stay in Init until driver opens csi socket, once it's done
// controller will exit this loop and proceed normally.
socketDown := true
grpcClient := &grpc.ClientConn{}
for socketDown {
	grpcClient, err = ctrl.Connect(*csiEndpoint, *connectionTimeout)
	if err == nil {
		socketDown = false
		continue
	}
	time.Sleep(10 * time.Second)
}

Provisioner会停留在初始化状态,直到csi socket连接成功才正常运行。如果连接失败,会暂停10秒后重试,其中涉及以下2个参数:

  • csiEndpoint:CSI Volume的gRPC地址,默认通过为/run/csi/socket
  • connectionTimeout:连接CSI driver socket的超时时间,默认为10秒。

2.1.5. 构造csi-Provisioner对象

// Create the provisioner: it implements the Provisioner interface expected by
// the controller
csiProvisioner := ctrl.NewCSIProvisioner(clientset, csiAPIClient, *csiEndpoint, *connectionTimeout, identity, *volumeNamePrefix, *volumeNameUUIDLength, grpcClient, snapClient)
provisionController = controller.NewProvisionController(
	clientset,
	*provisioner,
	csiProvisioner,
	serverVersion.GitVersion,
)

通过参数clientset, csiAPIClient, csiEndpoint, connectionTimeout, identity, volumeNamePrefix, volumeNameUUIDLength, grpcClient, snapClient构造csi-Provisioner对象。

通过csiProvisioner构造ProvisionController对象。

2.1.6. 运行ProvisionController

func main() {
	provisionController.Run(wait.NeverStop)
}

ProvisionController实现了具体的PV和PVC的相关逻辑,Run方法以常驻进程的方式运行。

2.2. ProvisionDelete方法

2.2.1. Provision方法

csiProvisionerProvision方法具体源码参考:https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L336

Provision方法用来创建存储资源,并且返回一个PV对象。其中入参是VolumeOptions,用来指定PV对象的相关属性。

1、构造PV相关属性

pvName, err := makeVolumeName(p.volumeNamePrefix, fmt.Sprintf("%s", options.PVC.ObjectMeta.UID), p.volumeNameUUIDLength)
if err != nil {
	return nil, err
}

2、构造CSIPersistentVolumeSource相关属性

driverState, err := checkDriverState(p.grpcClient, p.timeout, needSnapshotSupport)
if err != nil {
	return nil, err
}

...
// Resolve controller publish, node stage, node publish secret references
controllerPublishSecretRef, err := getSecretReference(controllerPublishSecretNameKey, controllerPublishSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
	return nil, err
}
nodeStageSecretRef, err := getSecretReference(nodeStageSecretNameKey, nodeStageSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
	return nil, err
}
nodePublishSecretRef, err := getSecretReference(nodePublishSecretNameKey, nodePublishSecretNamespaceKey, options.Parameters, pvName, options.PVC)
if err != nil {
	return nil, err
}

...
volumeAttributes := map[string]string{provisionerIDKey: p.identity}
for k, v := range rep.Volume.Attributes {
	volumeAttributes[k] = v
}

...
fsType := ""
for k, v := range options.Parameters {
	switch strings.ToLower(k) {
	case "fstype":
		fsType = v
	}
}
if len(fsType) == 0 {
	fsType = defaultFSType
}

3、创建CSI CreateVolumeRequest

// Create a CSI CreateVolumeRequest and Response
req := csi.CreateVolumeRequest{
	Name:               pvName,
	Parameters:         options.Parameters,
	VolumeCapabilities: volumeCaps,
	CapacityRange: &csi.CapacityRange{
		RequiredBytes: int64(volSizeBytes),
	},
}
...
glog.V(5).Infof("CreateVolumeRequest %+v", req)

rep := &csi.CreateVolumeResponse{}
...
opts := wait.Backoff{Duration: backoffDuration, Factor: backoffFactor, Steps: backoffSteps}
err = wait.ExponentialBackoff(opts, func() (bool, error) {
	ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
	defer cancel()
	rep, err = p.csiClient.CreateVolume(ctx, &req)
	if err == nil {
		// CreateVolume has finished successfully
		return true, nil
	}

	if status, ok := status.FromError(err); ok {
		if status.Code() == codes.DeadlineExceeded {
			// CreateVolume timed out, give it another chance to complete
			glog.Warningf("CreateVolume timeout: %s has expired, operation will be retried", p.timeout.String())
			return false, nil
		}
	}
	// CreateVolume failed , no reason to retry, bailing from ExponentialBackoff
	return false, err
})

if err != nil {
	return nil, err
}

if rep.Volume != nil {
	glog.V(3).Infof("create volume rep: %+v", *rep.Volume)
}

respCap := rep.GetVolume().GetCapacityBytes()
if respCap < volSizeBytes {
	capErr := fmt.Errorf("created volume capacity %v less than requested capacity %v", respCap, volSizeBytes)
	delReq := &csi.DeleteVolumeRequest{
		VolumeId: rep.GetVolume().GetId(),
	}
	delReq.ControllerDeleteSecrets = provisionerCredentials
	ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
	defer cancel()
	_, err := p.csiClient.DeleteVolume(ctx, delReq)
	if err != nil {
		capErr = fmt.Errorf("%v. Cleanup of volume %s failed, volume is orphaned: %v", capErr, pvName, err)
	}
	return nil, capErr
}

Provison方法核心功能是调用p.csiClient.CreateVolume(ctx, &req)

4、构造PV对象

pv := &v1.PersistentVolume{
	ObjectMeta: metav1.ObjectMeta{
		Name: pvName,
	},
	Spec: v1.PersistentVolumeSpec{
		PersistentVolumeReclaimPolicy: options.PersistentVolumeReclaimPolicy,
		AccessModes:                   options.PVC.Spec.AccessModes,
		Capacity: v1.ResourceList{
			v1.ResourceName(v1.ResourceStorage): bytesToGiQuantity(respCap),
		},
		// TODO wait for CSI VolumeSource API
		PersistentVolumeSource: v1.PersistentVolumeSource{
			CSI: &v1.CSIPersistentVolumeSource{
				Driver:                     driverState.driverName,
				VolumeHandle:               p.volumeIdToHandle(rep.Volume.Id),
				FSType:                     fsType,
				VolumeAttributes:           volumeAttributes,
				ControllerPublishSecretRef: controllerPublishSecretRef,
				NodeStageSecretRef:         nodeStageSecretRef,
				NodePublishSecretRef:       nodePublishSecretRef,
			},
		},
	},
}

if driverState.capabilities.Has(PluginCapability_ACCESSIBILITY_CONSTRAINTS) {
	pv.Spec.NodeAffinity = GenerateVolumeNodeAffinity(rep.Volume.AccessibleTopology)
}

glog.Infof("successfully created PV %+v", pv.Spec.PersistentVolumeSource)

return pv, nil

Provision方法只是通过VolumeOptions参数来构建PV对象,并没有执行具体PV的创建或删除的操作。

不同类型的Provisioner的,一般是PersistentVolumeSource类型和参数不同,例如csi-provisioner对应的PersistentVolumeSourceCSI,并且需要传入CSI相关的参数:

  • Driver
  • VolumeHandle
  • FSType
  • VolumeAttributes
  • ControllerPublishSecretRef
  • NodeStageSecretRef
  • NodePublishSecretRef

2.2.2. Delete方法

csiProvisionerdelete方法具体源码参考:https://github.com/kubernetes-csi/external-provisioner/blob/master/pkg/controller/controller.go#L606

func (p *csiProvisioner) Delete(volume *v1.PersistentVolume) error {
	if volume == nil || volume.Spec.CSI == nil {
		return fmt.Errorf("invalid CSI PV")
	}
	volumeId := p.volumeHandleToId(volume.Spec.CSI.VolumeHandle)

	_, err := checkDriverState(p.grpcClient, p.timeout, false)
	if err != nil {
		return err
	}

	req := csi.DeleteVolumeRequest{
		VolumeId: volumeId,
	}
	// get secrets if StorageClass specifies it
	storageClassName := volume.Spec.StorageClassName
	if len(storageClassName) != 0 {
		if storageClass, err := p.client.StorageV1().StorageClasses().Get(storageClassName, metav1.GetOptions{}); err == nil {
			// Resolve provision secret credentials.
			// No PVC is provided when resolving provision/delete secret names, since the PVC may or may not exist at delete time.
			provisionerSecretRef, err := getSecretReference(provisionerSecretNameKey, provisionerSecretNamespaceKey, storageClass.Parameters, volume.Name, nil)
			if err != nil {
				return err
			}
			credentials, err := getCredentials(p.client, provisionerSecretRef)
			if err != nil {
				return err
			}
			req.ControllerDeleteSecrets = credentials
		}

	}
	ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
	defer cancel()

	_, err = p.csiClient.DeleteVolume(ctx, &req)

	return err
}

Delete方法主要是调用了p.csiClient.DeleteVolume(ctx, &req)方法。

2.3. 总结

csi provisioner实现了Provisioner接口,其中包含ProvisonDelete两个方法:

  • Provision:调用csiClient.CreateVolume方法,同时构造并返回PV对象。
  • Delete:调用csiClient.DeleteVolume方法。

csi provisioner的核心方法都调用了csi-client相关方法。

3. csi-client

csi client的相关代码参考:https://github.com/container-storage-interface/spec/blob/master/lib/go/csi/v0/csi.pb.go

3.1. 构造csi-client

3.1.1. 构造grpcClient

// Provisioner will stay in Init until driver opens csi socket, once it's done
// controller will exit this loop and proceed normally.
socketDown := true
grpcClient := &grpc.ClientConn{}
for socketDown {
	grpcClient, err = ctrl.Connect(*csiEndpoint, *connectionTimeout)
	if err == nil {
		socketDown = false
		continue
	}
	time.Sleep(10 * time.Second)
}

通过连接csi socket,连接成功才构造可用的grpcClient

3.1.2. 构造csi-client

通过grpcClient构造csi-client

// Create the provisioner: it implements the Provisioner interface expected by
// the controller
csiProvisioner := ctrl.NewCSIProvisioner(clientset, csiAPIClient, *csiEndpoint, *connectionTimeout, identity, *volumeNamePrefix, *volumeNameUUIDLength, grpcClient, snapClient)

NewCSIProvisioner

// NewCSIProvisioner creates new CSI provisioner
func NewCSIProvisioner(client kubernetes.Interface,
	csiAPIClient csiclientset.Interface,
	csiEndpoint string,
	connectionTimeout time.Duration,
	identity string,
	volumeNamePrefix string,
	volumeNameUUIDLength int,
	grpcClient *grpc.ClientConn,
	snapshotClient snapclientset.Interface) controller.Provisioner {

	csiClient := csi.NewControllerClient(grpcClient)
	provisioner := &csiProvisioner{
		client:               client,
		grpcClient:           grpcClient,
		csiClient:            csiClient,
		csiAPIClient:         csiAPIClient,
		snapshotClient:       snapshotClient,
		timeout:              connectionTimeout,
		identity:             identity,
		volumeNamePrefix:     volumeNamePrefix,
		volumeNameUUIDLength: volumeNameUUIDLength,
	}
	return provisioner
}

NewControllerClient

csiClient := csi.NewControllerClient(grpcClient)
...
type controllerClient struct {
	cc *grpc.ClientConn
}

func NewControllerClient(cc *grpc.ClientConn) ControllerClient {
	return &controllerClient{cc}
}

3.2. csiClient.CreateVolume

csi provisoner中调用csiClient.CreateVolume代码如下:

opts := wait.Backoff{Duration: backoffDuration, Factor: backoffFactor, Steps: backoffSteps}
err = wait.ExponentialBackoff(opts, func() (bool, error) {
	ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
	defer cancel()
	rep, err = p.csiClient.CreateVolume(ctx, &req)
	if err == nil {
		// CreateVolume has finished successfully
		return true, nil
	}

	if status, ok := status.FromError(err); ok {
		if status.Code() == codes.DeadlineExceeded {
			// CreateVolume timed out, give it another chance to complete
			glog.Warningf("CreateVolume timeout: %s has expired, operation will be retried", p.timeout.String())
			return false, nil
		}
	}
	// CreateVolume failed , no reason to retry, bailing from ExponentialBackoff
	return false, err
})

CreateVolumeRequest的构造:

// Create a CSI CreateVolumeRequest and Response
req := csi.CreateVolumeRequest{
	Name:               pvName,
	Parameters:         options.Parameters,
	VolumeCapabilities: volumeCaps,
	CapacityRange: &csi.CapacityRange{
		RequiredBytes: int64(volSizeBytes),
	},
}
...
req.VolumeContentSource = volumeContentSource
...
req.AccessibilityRequirements = requirements
...
req.ControllerCreateSecrets = provisionerCredentials

具体的Create实现方法如下:

其中csiClient是个接口类型

具体代码参考controllerClient.CreateVolume

func (c *controllerClient) CreateVolume(ctx context.Context, in *CreateVolumeRequest, opts ...grpc.CallOption) (*CreateVolumeResponse, error) {
	out := new(CreateVolumeResponse)
	err := grpc.Invoke(ctx, "/csi.v0.Controller/CreateVolume", in, out, c.cc, opts...)
	if err != nil {
		return nil, err
	}
	return out, nil
}

3.3. csiClient.DeleteVolume

csi provisoner中调用csiClient.DeleteVolume代码如下:

func (p *csiProvisioner) Delete(volume *v1.PersistentVolume) error {
	...
	req := csi.DeleteVolumeRequest{
		VolumeId: volumeId,
	}
	// get secrets if StorageClass specifies it
	...
    
	ctx, cancel := context.WithTimeout(context.Background(), p.timeout)
	defer cancel()

	_, err = p.csiClient.DeleteVolume(ctx, &req)

	return err
}

DeleteVolumeRequest的构造:

req := csi.DeleteVolumeRequest{
	VolumeId: volumeId,
}
...
req.ControllerDeleteSecrets = credentials

将构造的DeleteVolumeRequest传给DeleteVolume方法。

具体的Delete实现方法如下:

具体代码参考:controllerClient.DeleteVolume

func (c *controllerClient) DeleteVolume(ctx context.Context, in *DeleteVolumeRequest, opts ...grpc.CallOption) (*DeleteVolumeResponse, error) {
	out := new(DeleteVolumeResponse)
	err := grpc.Invoke(ctx, "/csi.v0.Controller/DeleteVolume", in, out, c.cc, opts...)
	if err != nil {
		return nil, err
	}
	return out, nil
}

4. ProvisionController.Run

自定义的provisioner实现了Provisoner接口ProvisionDelete方法,这两个方法主要对后端存储做创建和删除操作,并没有对PV对象进行创建和删除操作。

PV对象的相关操作具体由ProvisionController中的provisionClaimOperationdeleteVolumeOperation具体执行,同时调用了具体provisionerProvisionDelete两个方法来对存储数据做处理。

func main() {
	provisionController.Run(wait.NeverStop)
}

这块代码逻辑可参考:nfs-client-provisioner 源码分析

参考文章:

9.3.2 - nfs-client-provisioner源码分析

如果要开发一个Dynamic Provisioner,需要使用到the helper library

1. Dynamic Provisioner

1.1. Provisioner Interface

开发Dynamic Provisioner需要实现Provisioner接口,该接口有两个方法,分别是:

  • Provision:创建存储资源,并且返回一个PV对象。
  • Delete:移除对应的存储资源,但并没有删除PV对象。

Provisioner 接口源码如下:

// Provisioner is an interface that creates templates for PersistentVolumes
// and can create the volume as a new resource in the infrastructure provider.
// It can also remove the volume it created from the underlying storage
// provider.
type Provisioner interface {
	// Provision creates a volume i.e. the storage asset and returns a PV object
	// for the volume
	Provision(VolumeOptions) (*v1.PersistentVolume, error)
	// Delete removes the storage asset that was created by Provision backing the
	// given PV. Does not delete the PV object itself.
	//
	// May return IgnoredError to indicate that the call has been ignored and no
	// action taken.
	Delete(*v1.PersistentVolume) error
}

1.2. VolumeOptions

Provisioner接口的Provision方法的入参是一个VolumeOptions对象。VolumeOptions对象包含了创建PV对象所需要的信息,例如:PV的回收策略,PV的名字,PV所对应的PVC对象以及PVC的StorageClass对象使用的参数等。

VolumeOptions 源码如下:

// VolumeOptions contains option information about a volume
// https://github.com/kubernetes/kubernetes/blob/release-1.4/pkg/volume/plugins.go
type VolumeOptions struct {
	// Reclamation policy for a persistent volume
	PersistentVolumeReclaimPolicy v1.PersistentVolumeReclaimPolicy
	// PV.Name of the appropriate PersistentVolume. Used to generate cloud
	// volume name.
	PVName string

	// PV mount options. Not validated - mount of the PVs will simply fail if one is invalid.
	MountOptions []string

	// PVC is reference to the claim that lead to provisioning of a new PV.
	// Provisioners *must* create a PV that would be matched by this PVC,
	// i.e. with required capacity, accessMode, labels matching PVC.Selector and
	// so on.
	PVC *v1.PersistentVolumeClaim
	// Volume provisioning parameters from StorageClass
	Parameters map[string]string

	// Node selected by the scheduler for the volume.
	SelectedNode *v1.Node
	// Topology constraint parameter from StorageClass
	AllowedTopologies []v1.TopologySelectorTerm
}

1.3. ProvisionController

ProvisionController是一个给PVC提供PV的控制器,具体执行Provisioner接口的ProvisionDelete的方法的所有逻辑。

1.4. 开发provisioner的步骤

  1. 写一个provisioner实现Provisioner接口(包含ProvisionDelete的方法)。
  2. 通过该provisioner构建ProvisionController
  3. 执行ProvisionControllerRun方法。

2. NFS Client Provisioner

nfs-client-provisioner是一个automatic provisioner,使用NFS作为存储,自动创建PV和对应的PVC,本身不提供NFS存储,需要外部先有一套NFS存储服务。

  • PV以 ${namespace}-${pvcName}-${pvName}的命名格式提供(在NFS服务器上)
  • PV回收的时候以 archieved-${namespace}-${pvcName}-${pvName} 的命名格式(在NFS服务器上)

以下通过nfs-client-provisioner的源码分析来说明开发自定义provisioner整个过程。nfs-client-provisioner的主要代码都在provisioner.go的文件中。

nfs-client-provisioner源码地址:https://github.com/kubernetes-incubator/external-storage/tree/master/nfs-client

2.1. Main函数

2.1.1. 读取环境变量

源码如下:

func main() {
	flag.Parse()
	flag.Set("logtostderr", "true")

	server := os.Getenv("NFS_SERVER")
	if server == "" {
		glog.Fatal("NFS_SERVER not set")
	}
	path := os.Getenv("NFS_PATH")
	if path == "" {
		glog.Fatal("NFS_PATH not set")
	}
	provisionerName := os.Getenv(provisionerNameKey)
	if provisionerName == "" {
		glog.Fatalf("environment variable %s is not set! Please set it.", provisionerNameKey)
	}
    ...
}   

main函数先获取NFS_SERVERNFS_PATHPROVISIONER_NAME三个环境变量的值,因此在部署nfs-client-provisioner的时候,需要将这三个环境变量的值传入。

  • NFS_SERVER:NFS服务端的IP地址。
  • NFS_PATH:NFS服务端设置的共享目录
  • PROVISIONER_NAME:provisioner的名字,需要和StorageClass对象中的provisioner字段一致。

例如StorageClass对象的yaml文件如下:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: managed-nfs-storage
provisioner: fuseim.pri/ifs # or choose another name, must match deployment's env PROVISIONER_NAME'
parameters:
  archiveOnDelete: "false" # When set to "false" your PVs will not be archived by the provisioner upon deletion of the PVC.

2.1.2. 获取clientset对象

源码如下:

// Create an InClusterConfig and use it to create a client for the controller
// to use to communicate with Kubernetes
config, err := rest.InClusterConfig()
if err != nil {
	glog.Fatalf("Failed to create config: %v", err)
}
clientset, err := kubernetes.NewForConfig(config)
if err != nil {
	glog.Fatalf("Failed to create client: %v", err)
}

通过读取对应的k8s的配置,创建clientset对象,用来执行k8s对应的API,其中主要包括对PV和PVC等对象的创建删除等操作。

2.1.3. 构造nfsProvisioner对象

源码如下:

// The controller needs to know what the server version is because out-of-tree
// provisioners aren't officially supported until 1.5
serverVersion, err := clientset.Discovery().ServerVersion()
if err != nil {
	glog.Fatalf("Error getting server version: %v", err)
}

clientNFSProvisioner := &nfsProvisioner{
	client: clientset,
	server: server,
	path:   path,
}

通过clientsetserverpath等值构造nfsProvisioner对象,同时还获取了k8s的版本信息,因为provisioners的功能在k8s 1.5及以上版本才支持。

nfsProvisioner类型定义如下:

type nfsProvisioner struct {
	client kubernetes.Interface
	server string
	path   string
}

var _ controller.Provisioner = &nfsProvisioner{}

nfsProvisioner是一个自定义的provisioner,用来实现Provisioner的接口,其中的属性除了serverpath这两个关于NFS相关的参数,还包含了client,主要用来调用k8s的API。

var _ controller.Provisioner = &nfsProvisioner{}

以上用法用来检测nfsProvisioner是否实现了Provisioner的接口。

2.1.4. 构建并运行ProvisionController

源码如下:

// Start the provision controller which will dynamically provision efs NFS
// PVs
pc := controller.NewProvisionController(clientset, provisionerName, clientNFSProvisioner, serverVersion.GitVersion)
pc.Run(wait.NeverStop)

通过nfsProvisioner构造ProvisionController对象并执行Run方法,ProvisionController实现了具体的PV和PVC的相关逻辑,Run方法以常驻进程的方式运行。

2.2. ProvisionDelete方法

2.2.1. Provision方法

nfsProvisionerProvision方法具体源码参考:https://github.com/kubernetes-incubator/external-storage/blob/master/nfs-client/cmd/nfs-client-provisioner/provisioner.go#L56

Provision方法用来创建存储资源,并且返回一个PV对象。其中入参是VolumeOptions,用来指定PV对象的相关属性。

1、构建PV和PVC的名称

func (p *nfsProvisioner) Provision(options controller.VolumeOptions) (*v1.PersistentVolume, error) {
	if options.PVC.Spec.Selector != nil {
		return nil, fmt.Errorf("claim Selector is not supported")
	}
	glog.V(4).Infof("nfs provisioner: VolumeOptions %v", options)

	pvcNamespace := options.PVC.Namespace
	pvcName := options.PVC.Name

	pvName := strings.Join([]string{pvcNamespace, pvcName, options.PVName}, "-")

	fullPath := filepath.Join(mountPath, pvName)
	glog.V(4).Infof("creating path %s", fullPath)
	if err := os.MkdirAll(fullPath, 0777); err != nil {
		return nil, errors.New("unable to create directory to provision new pv: " + err.Error())
	}
	os.Chmod(fullPath, 0777)

	path := filepath.Join(p.path, pvName)
    ...
}    

通过VolumeOptions的入参,构建PV和PVC的名称,以及创建路径path。

2、构造PV对象

pv := &v1.PersistentVolume{
	ObjectMeta: metav1.ObjectMeta{
		Name: options.PVName,
	},
	Spec: v1.PersistentVolumeSpec{
		PersistentVolumeReclaimPolicy: options.PersistentVolumeReclaimPolicy,
		AccessModes:                   options.PVC.Spec.AccessModes,
		MountOptions:                  options.MountOptions,
		Capacity: v1.ResourceList{
			v1.ResourceName(v1.ResourceStorage): options.PVC.Spec.Resources.Requests[v1.ResourceName(v1.ResourceStorage)],
		},
		PersistentVolumeSource: v1.PersistentVolumeSource{
			NFS: &v1.NFSVolumeSource{
				Server:   p.server,
				Path:     path,
				ReadOnly: false,
			},
		},
	},
}
return pv, nil

综上可以看出,Provision方法只是通过VolumeOptions参数来构建PV对象,并没有执行具体PV的创建或删除的操作。

不同类型的Provisioner的,一般是PersistentVolumeSource类型和参数不同,例如nfs-provisioner对应的PersistentVolumeSourceNFS,并且需要传入NFS相关的参数:ServerPath等。

2.2.2. Delete方法

nfsProvisionerdelete方法具体源码参考:https://github.com/kubernetes-incubator/external-storage/blob/master/nfs-client/cmd/nfs-client-provisioner/provisioner.go#L99

1、获取pvName和path等相关参数

func (p *nfsProvisioner) Delete(volume *v1.PersistentVolume) error {
	path := volume.Spec.PersistentVolumeSource.NFS.Path
	pvName := filepath.Base(path)
	oldPath := filepath.Join(mountPath, pvName)
	if _, err := os.Stat(oldPath); os.IsNotExist(err) {
		glog.Warningf("path %s does not exist, deletion skipped", oldPath)
		return nil
	}
    ...
}    

通过pathpvName生成oldPath,其中oldPath是原先NFS服务器上pod对应的数据持久化存储路径。

2、获取archiveOnDelete参数并删除数据

// Get the storage class for this volume.
storageClass, err := p.getClassForVolume(volume)
if err != nil {
	return err
}
// Determine if the "archiveOnDelete" parameter exists.
// If it exists and has a falsey value, delete the directory.
// Otherwise, archive it.
archiveOnDelete, exists := storageClass.Parameters["archiveOnDelete"]
if exists {
	archiveBool, err := strconv.ParseBool(archiveOnDelete)
	if err != nil {
		return err
	}
	if !archiveBool {
		return os.RemoveAll(oldPath)
	}
}

如果storageClass对象中指定archiveOnDelete参数并且值为false,则会自动删除oldPath下的所有数据,即pod对应的数据持久化存储数据。

archiveOnDelete字面意思为删除时是否存档,false表示不存档,即删除数据,true表示存档,即重命名路径。

3、重命名旧数据路径

archivePath := filepath.Join(mountPath, "archived-"+pvName)
glog.V(4).Infof("archiving path %s to %s", oldPath, archivePath)
return os.Rename(oldPath, archivePath)

如果storageClass对象中没有指定archiveOnDelete参数或者值为true,表明需要删除时存档,即将oldPath重命名,命名格式为oldPath前面增加archived-的前缀。

3. ProvisionController

3.1. ProvisionController结构体

源码具体参考:https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/controller.go#L82

ProvisionController是一个给PVC提供PV的控制器,具体执行Provisioner接口的ProvisionDelete的方法的所有逻辑。

3.1.1. 入参

// ProvisionController is a controller that provisions PersistentVolumes for
// PersistentVolumeClaims.
type ProvisionController struct {
	client kubernetes.Interface

	// The name of the provisioner for which this controller dynamically
	// provisions volumes. The value of annDynamicallyProvisioned and
	// annStorageProvisioner to set & watch for, respectively
	provisionerName string

	// The provisioner the controller will use to provision and delete volumes.
	// Presumably this implementer of Provisioner carries its own
	// volume-specific options and such that it needs in order to provision
	// volumes.
	provisioner Provisioner

	// Kubernetes cluster server version:
	// * 1.4: storage classes introduced as beta. Technically out-of-tree dynamic
	// provisioning is not officially supported, though it works
	// * 1.5: storage classes stay in beta. Out-of-tree dynamic provisioning is
	// officially supported
	// * 1.6: storage classes enter GA
	kubeVersion *utilversion.Version
    ...
}   

clientprovisionerNameprovisionerkubeVersion等属性作为NewProvisionController的入参。

  • client:clientset客户端,用来调用k8s的API。
  • provisionerName:provisioner的名字,需要和StorageClass对象中的provisioner字段一致。
  • provisioner:具体的provisioner的实现者,本文为nfsProvisioner
  • kubeVersion:k8s的版本信息。

3.1.2. Controller和Informer

type ProvisionController struct {
	...
	claimInformer    cache.SharedInformer
	claims           cache.Store
	claimController  cache.Controller
	volumeInformer   cache.SharedInformer
	volumes          cache.Store
	volumeController cache.Controller
	classInformer    cache.SharedInformer
	classes          cache.Store
	classController  cache.Controller
    ...
}    

ProvisionController结构体中包含了PVPVCStorageClass三个对象的ControllerInformerStore,主要用来执行这三个对象的相关操作。

  • Controller:通用的控制框架
  • Informer:消息通知器
  • Store:通用的对象存储接口

3.1.3. workqueue

type ProvisionController struct {
    ...
	claimQueue  workqueue.RateLimitingInterface
	volumeQueue workqueue.RateLimitingInterface
    ...
}    

claimQueuevolumeQueue分别是PVPVC的任务队列。

3.1.4. 其他

// Identity of this controller, generated at creation time and not persisted
// across restarts. Useful only for debugging, for seeing the source of
// events. controller.provisioner may have its own, different notion of
// identity which may/may not persist across restarts
id            string
component     string
eventRecorder record.EventRecorder

resyncPeriod time.Duration

exponentialBackOffOnError bool
threadiness               int

createProvisionedPVRetryCount int
createProvisionedPVInterval   time.Duration

failedProvisionThreshold, failedDeleteThreshold int

// The port for metrics server to serve on.
metricsPort int32
// The IP address for metrics server to serve on.
metricsAddress string
// The path of metrics endpoint path.
metricsPath string

// Parameters of leaderelection.LeaderElectionConfig.
leaseDuration, renewDeadline, retryPeriod time.Duration

hasRun     bool
hasRunLock *sync.Mutex

3.2. NewProvisionController方法

源码地址:https://github.com/kubernetes-incubator/external-storage/blob/master/lib/controller/controller.go#L418

NewProvisionController方法主要用来构造ProvisionController

3.2.1. 初始化默认值

// NewProvisionController creates a new provision controller using
// the given configuration parameters and with private (non-shared) informers.
func NewProvisionController(
	client kubernetes.Interface,
	provisionerName string,
	provisioner Provisioner,
	kubeVersion string,
	options ...func(*ProvisionController) error,
) *ProvisionController {
	...
	controller := &ProvisionController{
		client:                        client,
		provisionerName:               provisionerName,
		provisioner:                   provisioner,
		kubeVersion:                   utilversion.MustParseSemantic(kubeVersion),
		id:                            id,
		component:                     component,
		eventRecorder:                 eventRecorder,
		resyncPeriod:                  DefaultResyncPeriod,
		exponentialBackOffOnError:     DefaultExponentialBackOffOnError,
		threadiness:                   DefaultThreadiness,
		createProvisionedPVRetryCount: DefaultCreateProvisionedPVRetryCount,
		createProvisionedPVInterval:   DefaultCreateProvisionedPVInterval,
		failedProvisionThreshold:      DefaultFailedProvisionThreshold,
		failedDeleteThreshold:         DefaultFailedDeleteThreshold,
		leaseDuration:                 DefaultLeaseDuration,
		renewDeadline:                 DefaultRenewDeadline,
		retryPeriod:                   DefaultRetryPeriod,
		metricsPort:                   DefaultMetricsPort,
		metricsAddress:                DefaultMetricsAddress,
		metricsPath:                   DefaultMetricsPath,
		hasRun:                        false,
		hasRunLock:                    &sync.Mutex{},
	}
    ...
}    

3.2.2. 初始化任务队列

ratelimiter := workqueue.NewMaxOfRateLimiter(
	workqueue.NewItemExponentialFailureRateLimiter(15*time.Second, 1000*time.Second),
	&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
)
if !controller.exponentialBackOffOnError {
	ratelimiter = workqueue.NewMaxOfRateLimiter(
		workqueue.NewItemExponentialFailureRateLimiter(15*time.Second, 15*time.Second),
		&workqueue.BucketRateLimiter{Limiter: rate.NewLimiter(rate.Limit(10), 100)},
	)
}
controller.claimQueue = workqueue.NewNamedRateLimitingQueue(ratelimiter, "claims")
controller.volumeQueue = workqueue.NewNamedRateLimitingQueue(ratelimiter, "volumes")

3.2.3. ListWatch

// PVC
claimSource := &cache.ListWatch{
	ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
		return client.CoreV1().PersistentVolumeClaims(v1.NamespaceAll).List(options)
	},
	WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
		return client.CoreV1().PersistentVolumeClaims(v1.NamespaceAll).Watch(options)
	},
}
// PV
volumeSource := &cache.ListWatch{
	ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
		return client.CoreV1().PersistentVolumes().List(options)
	},
	WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
		return client.CoreV1().PersistentVolumes().Watch(options)
	},
}
// StorageClass
classSource = &cache.ListWatch{
	ListFunc: func(options metav1.ListOptions) (runtime.Object, error) {
		return client.StorageV1().StorageClasses().List(options)
	},
	WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) {
		return client.StorageV1().StorageClasses().Watch(options)
	},
}

list-watch机制是k8s中用来监听对象变化的核心机制,ListWatch包含ListFuncWatchFunc两个函数,且不能为空,以上代码分别构造了PV、PVC、StorageClass三个对象的ListWatch结构体。该机制的实现在client-gocache包中,具体参考:https://godoc.org/k8s.io/client-go/tools/cache。

更多ListWatch代码如下:

具体参考:https://github.com/kubernetes-incubator/external-storage/blob/89b0aaf6413b249b37834b124fc314ef7b8ee949/vendor/k8s.io/client-go/tools/cache/listwatch.go#L34

// ListerWatcher is any object that knows how to perform an initial list and start a watch on a resource.
type ListerWatcher interface {
	// List should return a list type object; the Items field will be extracted, and the
	// ResourceVersion field will be used to start the watch in the right place.
	List(options metav1.ListOptions) (runtime.Object, error)
	// Watch should begin a watch at the specified version.
	Watch(options metav1.ListOptions) (watch.Interface, error)
}

// ListFunc knows how to list resources
type ListFunc func(options metav1.ListOptions) (runtime.Object, error)

// WatchFunc knows how to watch resources
type WatchFunc func(options metav1.ListOptions) (watch.Interface, error)

// ListWatch knows how to list and watch a set of apiserver resources.  It satisfies the ListerWatcher interface.
// It is a convenience function for users of NewReflector, etc.
// ListFunc and WatchFunc must not be nil
type ListWatch struct {
	ListFunc  ListFunc
	WatchFunc WatchFunc
	// DisableChunking requests no chunking for this list watcher.
	DisableChunking bool
}

3.2.4. ResourceEventHandlerFuncs

// PVC
claimHandler := cache.ResourceEventHandlerFuncs{
	AddFunc:    func(obj interface{}) { controller.enqueueWork(controller.claimQueue, obj) },
	UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.claimQueue, newObj) },
	DeleteFunc: func(obj interface{}) { controller.forgetWork(controller.claimQueue, obj) },
}
// PV
volumeHandler := cache.ResourceEventHandlerFuncs{
	AddFunc:    func(obj interface{}) { controller.enqueueWork(controller.volumeQueue, obj) },
	UpdateFunc: func(oldObj, newObj interface{}) { controller.enqueueWork(controller.volumeQueue, newObj) },
	DeleteFunc: func(obj interface{}) { controller.forgetWork(controller.volumeQueue, obj) },
}
// StorageClass
classHandler := cache.ResourceEventHandlerFuncs{
	// We don't need an actual event handler for StorageClasses,
	// but we must pass a non-nil one to cache.NewInformer()
	AddFunc:    nil,
	UpdateFunc: nil,
	DeleteFunc: nil,
}

ResourceEventHandlerFuncs是资源事件处理函数,主要用来对k8s资源对象增删改变化的事件进行消息通知,该函数实现了ResourceEventHandler的接口。具体代码逻辑在client-go的cache包中。

更多ResourceEventHandlerFuncs代码可参考:

// ResourceEventHandler can handle notifications for events that happen to a
// resource. The events are informational only, so you can't return an
// error.
//  * OnAdd is called when an object is added.
//  * OnUpdate is called when an object is modified. Note that oldObj is the
//      last known state of the object-- it is possible that several changes
//      were combined together, so you can't use this to see every single
//      change. OnUpdate is also called when a re-list happens, and it will
//      get called even if nothing changed. This is useful for periodically
//      evaluating or syncing something.
//  * OnDelete will get the final state of the item if it is known, otherwise
//      it will get an object of type DeletedFinalStateUnknown. This can
//      happen if the watch is closed and misses the delete event and we don't
//      notice the deletion until the subsequent re-list.
type ResourceEventHandler interface {
	OnAdd(obj interface{})
	OnUpdate(oldObj, newObj interface{})
	OnDelete(obj interface{})
}

// ResourceEventHandlerFuncs is an adaptor to let you easily specify as many or
// as few of the notification functions as you want while still implementing
// ResourceEventHandler.
type ResourceEventHandlerFuncs struct {
	AddFunc    func(obj interface{})
	UpdateFunc func(oldObj, newObj interface{})
	DeleteFunc func(obj interface{})
}

3.2.5. 构造Store和Controller

1、PVC

if controller.claimInformer != nil {
	controller.claimInformer.AddEventHandlerWithResyncPeriod(claimHandler, controller.resyncPeriod)
	controller.claims, controller.claimController =
		controller.claimInformer.GetStore(),
		controller.claimInformer.GetController()
} else {
	controller.claims, controller.claimController =
		cache.NewInformer(
			claimSource,
			&v1.PersistentVolumeClaim{},
			controller.resyncPeriod,
			claimHandler,
		)
}

2、PV

if controller.volumeInformer != nil {
	controller.volumeInformer.AddEventHandlerWithResyncPeriod(volumeHandler, controller.resyncPeriod)
	controller.volumes, controller.volumeController =
		controller.volumeInformer.GetStore(),
		controller.volumeInformer.GetController()
} else {
	controller.volumes, controller.volumeController =
		cache.NewInformer(
			volumeSource,
			&v1.PersistentVolume{},
			controller.resyncPeriod,
			volumeHandler,
		)
}

3、StorageClass

if controller.classInformer != nil {
	// no resource event handler needed for StorageClasses
	controller.classes, controller.classController =
		controller.classInformer.GetStore(),
		controller.classInformer.GetController()
} else {
	controller.classes, controller.classController = cache.NewInformer(
		classSource,
		versionedClassType,
		controller.resyncPeriod,
		classHandler,
	)
}

通过cache.NewInformer的方法构造,入参是ListWatch结构体和ResourceEventHandlerFuncs函数等,返回值是StoreController

通过以上各个部分的构造,最后返回一个具体的ProvisionController对象。

3.3. ProvisionController.Run方法

ProvisionControllerRun方法是以常驻进程的方式运行,函数内部再运行其他的controller。

3.3.1. prometheus数据收集

// Run starts all of this controller's control loops
func (ctrl *ProvisionController) Run(stopCh <-chan struct{}) {

	run := func(stopCh <-chan struct{}) {
		...
		if ctrl.metricsPort > 0 {
			prometheus.MustRegister([]prometheus.Collector{
				metrics.PersistentVolumeClaimProvisionTotal,
				metrics.PersistentVolumeClaimProvisionFailedTotal,
				metrics.PersistentVolumeClaimProvisionDurationSeconds,
				metrics.PersistentVolumeDeleteTotal,
				metrics.PersistentVolumeDeleteFailedTotal,
				metrics.PersistentVolumeDeleteDurationSeconds,
			}...)
			http.Handle(ctrl.metricsPath, promhttp.Handler())
			address := net.JoinHostPort(ctrl.metricsAddress, strconv.FormatInt(int64(ctrl.metricsPort), 10))
			glog.Infof("Starting metrics server at %s\n", address)
			go wait.Forever(func() {
				err := http.ListenAndServe(address, nil)
				if err != nil {
					glog.Errorf("Failed to listen on %s: %v", address, err)
				}
			}, 5*time.Second)
		}
        ...
}        

3.3.2. Controller.Run

// If a SharedInformer has been passed in, this controller should not
// call Run again
if ctrl.claimInformer == nil {
	go ctrl.claimController.Run(stopCh)
}
if ctrl.volumeInformer == nil {
	go ctrl.volumeController.Run(stopCh)
}
if ctrl.classInformer == nil {
	go ctrl.classController.Run(stopCh)
}

运行消息通知器Informer。

3.3.3. Worker

for i := 0; i < ctrl.threadiness; i++ {
	go wait.Until(ctrl.runClaimWorker, time.Second, stopCh)
	go wait.Until(ctrl.runVolumeWorker, time.Second, stopCh)
}

runClaimWorkerrunVolumeWorker分别为PVC和PV的worker,这两个的具体执行体分别是processNextClaimWorkItemprocessNextVolumeWorkItem

执行流程如下:

PVC的函数调用流程

runClaimWorker→processNextClaimWorkItem→syncClaimHandler→syncClaim→provisionClaimOperation

PV的函数调用流程

runVolumeWorker→processNextVolumeWorkItem→syncVolumeHandler→syncVolume→deleteVolumeOperation

可见最后执行的函数分别是provisionClaimOperationdeleteVolumeOperation

3.4. Operation

3.4.1. provisionClaimOperation

1、provisionClaimOperation入参是PVC,通过PVC获得PV对象,并判断PV对象是否存在,如果存在则退出后续操作。

// provisionClaimOperation attempts to provision a volume for the given claim.
// Returns error, which indicates whether provisioning should be retried
// (requeue the claim) or not
func (ctrl *ProvisionController) provisionClaimOperation(claim *v1.PersistentVolumeClaim) error {
	// Most code here is identical to that found in controller.go of kube's PV controller...
	claimClass := helper.GetPersistentVolumeClaimClass(claim)
	operation := fmt.Sprintf("provision %q class %q", claimToClaimKey(claim), claimClass)
	glog.Infof(logOperation(operation, "started"))

	//  A previous doProvisionClaim may just have finished while we were waiting for
	//  the locks. Check that PV (with deterministic name) hasn't been provisioned
	//  yet.
	pvName := ctrl.getProvisionedVolumeNameForClaim(claim)
	volume, err := ctrl.client.CoreV1().PersistentVolumes().Get(pvName, metav1.GetOptions{})
	if err == nil && volume != nil {
		// Volume has been already provisioned, nothing to do.
		glog.Infof(logOperation(operation, "persistentvolume %q already exists, skipping", pvName))
		return nil
	}
    ...
}    

2、获取StorageClass对象中的ProvisionerReclaimPolicy参数,如果provisionerNameStorageClass对象中的provisioner字段不一致则报错并退出执行。

provisioner, parameters, err := ctrl.getStorageClassFields(claimClass)
if err != nil {
	glog.Errorf(logOperation(operation, "error getting claim's StorageClass's fields: %v", err))
	return nil
}
if provisioner != ctrl.provisionerName {
	// class.Provisioner has either changed since shouldProvision() or
	// annDynamicallyProvisioned contains different provisioner than
	// class.Provisioner.
	glog.Errorf(logOperation(operation, "unknown provisioner %q requested in claim's StorageClass", provisioner))
	return nil
}
// Check if this provisioner can provision this claim.
if err = ctrl.canProvision(claim); err != nil {
	ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", err.Error())
	glog.Errorf(logOperation(operation, "failed to provision volume: %v", err))
	return nil
}

reclaimPolicy := v1.PersistentVolumeReclaimDelete
if ctrl.kubeVersion.AtLeast(utilversion.MustParseSemantic("v1.8.0")) {
	reclaimPolicy, err = ctrl.fetchReclaimPolicy(claimClass)
	if err != nil {
		return err
	}
}

3、执行具体的provisioner.Provision方法,构建PV对象,例如本文中的provisionernfs-provisioner

options := VolumeOptions{
	PersistentVolumeReclaimPolicy: reclaimPolicy,
	PVName:            pvName,
	PVC:               claim,
	MountOptions:      mountOptions,
	Parameters:        parameters,
	SelectedNode:      selectedNode,
	AllowedTopologies: allowedTopologies,
}

ctrl.eventRecorder.Event(claim, v1.EventTypeNormal, "Provisioning", fmt.Sprintf("External provisioner is provisioning volume for claim %q", claimToClaimKey(claim)))

volume, err = ctrl.provisioner.Provision(options)
if err != nil {
	if ierr, ok := err.(*IgnoredError); ok {
		// Provision ignored, do nothing and hope another provisioner will provision it.
		glog.Infof(logOperation(operation, "volume provision ignored: %v", ierr))
		return nil
	}
	err = fmt.Errorf("failed to provision volume with StorageClass %q: %v", claimClass, err)
	ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningFailed", err.Error())
	return err
}

4、创建k8s的PV对象。

// Try to create the PV object several times
for i := 0; i < ctrl.createProvisionedPVRetryCount; i++ {
	glog.Infof(logOperation(operation, "trying to save persistentvvolume %q", volume.Name))
	if _, err = ctrl.client.CoreV1().PersistentVolumes().Create(volume); err == nil || apierrs.IsAlreadyExists(err) {
		// Save succeeded.
		if err != nil {
			glog.Infof(logOperation(operation, "persistentvolume %q already exists, reusing", volume.Name))
			err = nil
		} else {
			glog.Infof(logOperation(operation, "persistentvolume %q saved", volume.Name))
		}
		break
	}
	// Save failed, try again after a while.
	glog.Infof(logOperation(operation, "failed to save persistentvolume %q: %v", volume.Name, err))
	time.Sleep(ctrl.createProvisionedPVInterval)
}

5、创建PV失败,清理存储资源。

if err != nil {
	// Save failed. Now we have a storage asset outside of Kubernetes,
	// but we don't have appropriate PV object for it.
	// Emit some event here and try to delete the storage asset several
	// times.
	...
	for i := 0; i < ctrl.createProvisionedPVRetryCount; i++ {
		if err = ctrl.provisioner.Delete(volume); err == nil {
			// Delete succeeded
			glog.Infof(logOperation(operation, "cleaning volume %q succeeded", volume.Name))
			break
		}
		// Delete failed, try again after a while.
		glog.Infof(logOperation(operation, "failed to clean volume %q: %v", volume.Name, err))
		time.Sleep(ctrl.createProvisionedPVInterval)
	}
	if err != nil {
		// Delete failed several times. There is an orphaned volume and there
		// is nothing we can do about it.
		strerr := fmt.Sprintf("Error cleaning provisioned volume for claim %s: %v. Please delete manually.", claimToClaimKey(claim), err)
		glog.Error(logOperation(operation, strerr))
		ctrl.eventRecorder.Event(claim, v1.EventTypeWarning, "ProvisioningCleanupFailed", strerr)
	}
}

如果创建成功,则打印成功的日志,并返回nil

3.4.2. deleteVolumeOperation

1、deleteVolumeOperation入参是PV,先获得PV对象,并判断是否需要删除。

// deleteVolumeOperation attempts to delete the volume backing the given
// volume. Returns error, which indicates whether deletion should be retried
// (requeue the volume) or not
func (ctrl *ProvisionController) deleteVolumeOperation(volume *v1.PersistentVolume) error {
	...
	// This method may have been waiting for a volume lock for some time.
	// Our check does not have to be as sophisticated as PV controller's, we can
	// trust that the PV controller has set the PV to Released/Failed and it's
	// ours to delete
	newVolume, err := ctrl.client.CoreV1().PersistentVolumes().Get(volume.Name, metav1.GetOptions{})
	if err != nil {
		return nil
	}
	if !ctrl.shouldDelete(newVolume) {
		glog.Infof(logOperation(operation, "persistentvolume no longer needs deletion, skipping"))
		return nil
	}
    ...
}    

2、调用具体的provisionerDelete方法,例如,如果是nfs-provisioner,则是调用nfs-provisioner的Delete方法。

err = ctrl.provisioner.Delete(volume)
if err != nil {
	if ierr, ok := err.(*IgnoredError); ok {
		// Delete ignored, do nothing and hope another provisioner will delete it.
		glog.Infof(logOperation(operation, "volume deletion ignored: %v", ierr))
		return nil
	}
	// Delete failed, emit an event.
	glog.Errorf(logOperation(operation, "volume deletion failed: %v", err))
	ctrl.eventRecorder.Event(volume, v1.EventTypeWarning, "VolumeFailedDelete", err.Error())
	return err
}

3、删除k8s中的PV对象。

// Delete the volume
if err = ctrl.client.CoreV1().PersistentVolumes().Delete(volume.Name, nil); err != nil {
	// Oops, could not delete the volume and therefore the controller will
	// try to delete the volume again on next update.
	glog.Infof(logOperation(operation, "failed to delete persistentvolume: %v", err))
	return err
}

4. 总结

  1. Provisioner接口包含ProvisionDelete两个方法,自定义的provisioner需要实现这两个方法,这两个方法只是处理了跟存储类型相关的事项,并没有针对PVPVC对象的增删等操作。
  2. Provision方法主要用来构造PV对象,不同类型的Provisioner的,一般是PersistentVolumeSource类型和参数不同,例如nfs-provisioner对应的PersistentVolumeSourceNFS,并且需要传入NFS相关的参数:ServerPath等。
  3. Delete方法主要针对对应的存储类型,做数据存档(备份)或删除的处理。
  4. StorageClass对象需要单独创建,用来指定具体的provisioner来执行相关逻辑。
  5. provisionClaimOperationdeleteVolumeOperation具体执行了k8s中PV对象的创建和删除操作,同时调用了具体provisionerProvisionDelete两个方法来对存储数据做处理。

参考文章

10 - 问题排查

10.1 - 节点问题

10.1.1 - keycreate permission denied

问题描述

write /proc/self/attr/keycreate: permission denied

具体报错:

kuberuntime_manager.go:758] createPodSandbox for pod "ecc-hostpath-provisioner-8jbhf_kube-system(b8050fd3-4ffe-11eb-a82e-c6090b53405b)" failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "ecc-hostpath-provisioner-8jbhf": Error response from daemon: OCI runtime create failed: container_linux.go:349: starting container process caused "process_linux.go:449: container init caused \"write /proc/self/attr/keycreate: permission denied\"": unknown

解决办法

SELINUX未设置成disabled

# 将SELINUX设置成disabled
setenforce 0 # 临时生效
# 永久生效,但需重启,配合上述命令可以不用立即重启
sed -i "s/SELINUX=enforcing/SELINUX=disabled/g" /etc/selinux/config

# 查看SELinux状态
$ /usr/sbin/sestatus -v 
SELinux status:                 disabled

$ getenforce
Disabled

10.1.2 - Cgroup不支持pid资源

问题描述

机器内核版本较低,kubelet启动异常,报错如下:

Failed to start ContainerManager failed to initialize top level QOS containers: failed to update top level Burstable QOS cgroup : failed to set supported cgroup subsystems for cgroup [kubepods burstable]: Failed to find subsystem mount for required subsystem: pids

原因分析

低版本内核的cgroup不支持pids资源的功能,

cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	5	6	1
cpu	2	76	1
cpuacct	2	76	1
memory	4	76	1
devices	10	76	1
freezer	7	6	1
net_cls	3	6	1
blkio	8	76	1
perf_event	9	6	1
hugetlb	6	6	1

正常机器的cgroup

root@host:~# cat /proc/cgroups
#subsys_name	hierarchy	num_cgroups	enabled
cpuset	5	17	1
cpu	7	80	1
cpuacct	7	80	1
memory	12	80	1
devices	10	80	1
freezer	2	17	1
net_cls	4	17	1
blkio	8	80	1
perf_event	6	17	1
hugetlb	11	17	1
pids	3	80	1    # 此处支持pids资源
oom	9	1	1

解决方案

1、升级内核版本,使得cgroup支持pids资源。

或者

2、将kubelet的启动参数添加 SupportPodPidsLimit=false,SupportNodePidsLimit=false

vi /etc/systemd/system/kubelet.service

# 添加 kubelet 启动参数 
--feature-gates=... ,SupportPodPidsLimit=false,SupportNodePidsLimit=false \

systemctl daemon-reload && systemctl restart kubelet.service

文档参考:

10.1.3 - Cgroup子系统无法挂载

问题描述

内核版本: 5.4.56-200.el7.x86_64

docker报错

May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.565235530+08:00" level=warning msg="failed to load plugin io.containerd.snapshotter.v1.devmapper" error="devmapper not configured"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.565525512+08:00" level=warning msg="could not use snapshotter devmapper in metadata plugin" error="devmapper not configured"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.574734345+08:00" level=warning msg="Your kernel does not support CPU realtime scheduler"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.574792864+08:00" level=warning msg="Your kernel does not support cgroup blkio weight"
May 13 16:54:26 8b26d7a8 dockerd[44352]: time="2021-05-13T16:54:26.574800326+08:00" level=warning msg="Your kernel does not support cgroup blkio weight_device"

kubelet报错

kubelet

解决

cgroup问题解决:

1、curl https://pi-ops.oss-cn-hangzhou.aliyuncs.com/scripts/cgroupfs-mount.sh | bash

2、重启设备即可解决

10.2 - Pod驱逐

问题描述

节点Pod被驱逐

原因

1. 查看节点和该节点pod状态

查看节点状态为Ready,查看该节点的所有pod,发现存在被驱逐的pod和nvidia-device-plugin为pending

root@host:~$ kgpoallowide |grep 192.168.1.1
department-56   173e397c-ea35-4aac-85d8-07106e55d7b7   0/1       Evicted             0          52d       <none>            192.168.1.1   <none>
kube-system     nvidia-device-plugin-daemonset-d58d2   0/1       Pending             0          1s        <none>            192.168.1.1   <none>

2. 查看对应节点kubelet的日志

0905 15:42:13.182280   23506 eviction_manager.go:142] Failed to admit pod rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432) - node has conditions: [DiskPressure]
I0905 15:42:14.827343   23506 kubelet.go:1836] SyncLoop (ADD, "api"): "nvidia-device-plugin-daemonset-88sm6_kube-system(adbd9227-cfb0-11e9-9729-6c92bf5e2432)"
W0905 15:42:14.827372   23506 eviction_manager.go:142] Failed to admit pod nvidia-device-plugin-daemonset-88sm6_kube-system(adbd9227-cfb0-11e9-9729-6c92bf5e2432) - node has conditions: [DiskPressure]
I0905 15:42:15.722378   23506 kubelet_node_status.go:607] Update capacity for nvidia.com/gpu-share to 0
I0905 15:42:16.692488   23506 kubelet.go:1852] SyncLoop (DELETE, "api"): "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)"
W0905 15:42:16.698445   23506 status_manager.go:489] Failed to delete status for pod "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)": pod "rdma-device-plugin-daemonset-8nwb8" not found
I0905 15:42:16.698490   23506 kubelet.go:1846] SyncLoop (REMOVE, "api"): "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)"
I0905 15:42:16.699267   23506 kubelet.go:2040] Failed to delete pod "rdma-device-plugin-daemonset-8nwb8_kube-system(acc28a85-cfb0-11e9-9729-6c92bf5e2432)", err: pod not found
W0905 15:42:16.777355   23506 eviction_manager.go:332] eviction manager: attempting to reclaim nodefs
I0905 15:42:16.777384   23506 eviction_manager.go:346] eviction manager: must evict pod(s) to reclaim nodefs
E0905 15:42:16.777390   23506 eviction_manager.go:357] eviction manager: eviction thresholds have been met, but no pods are active to evict

存在关于pod驱逐相关的日志,驱逐的原因为node has conditions: [DiskPressure]

3. 查看磁盘相关信息

[root@host /]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        126G     0  126G   0% /dev
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           126G   27M  126G   1% /run
tmpfs           126G     0  126G   0% /sys/fs/cgroup
/dev/sda1        20G   19G     0 100% /   # 根目录磁盘满
/dev/nvme1n1    3.0T  191G  2.8T   7% /data2
/dev/nvme0n1    3.0T  1.3T  1.7T  44% /data1
/dev/sda4       182G   95G   87G  53% /data
/dev/sda3        20G  3.8G   15G  20% /usr/local
tmpfs            26G     0   26G   0% /run/user/0

发现根目录的磁盘盘,接着查看哪些文件占用磁盘。

[root@host ~/kata]# du -sh ./*
1.0M	./log
944K	./netlink
6.6G	./kernel3

/var/log/下存在7G 的日志。清理相关日志和无用文件后,根目录恢复空间。

[root@host /data]# df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs        126G     0  126G   0% /dev
tmpfs           126G     0  126G   0% /dev/shm
tmpfs           126G   27M  126G   1% /run
tmpfs           126G     0  126G   0% /sys/fs/cgroup
/dev/sda1        20G  5.8G   13G  32% /   # 根目录正常
/dev/nvme1n1    3.0T  191G  2.8T   7% /data2

查看节点pod状态,相关plugin的pod恢复正常。

root@host:~$ kgpoallowide |grep 192.168.1.1
kube-system     nvidia-device-plugin-daemonset-h4pjc   1/1       Running             0          16m       192.168.1.1   192.168.1.1   <none>
kube-system     rdma-device-plugin-daemonset-xlkbv     1/1       Running             0          16m       192.168.1.1   192.168.1.1   <none>

4. 查看kubelet配置

查看kubelet关于pod驱逐相关的参数配置,可见节点kubelet开启了驱逐机制,正常情况下该配置应该是关闭的。

ExecStart=/usr/local/bin/kubelet \
	...
  --eviction-hard=nodefs.available<1% \

解决方案

总结以上原因为,kubelet开启了pod驱逐的机制,根目录的磁盘达到100%,pod被驱逐,且无法再正常创建在该节点。

解决方案如下:

1、关闭kubelet的驱逐机制。

2、清除根目录的文件,恢复根目录空间,并后续增加根目录的磁盘监控。

10.3 - 镜像拉取失败问题

常见镜像拉取问题排查

1. Pod状态为ErrImagePull或ImagePullBackOff

docker-hub-75d4dfb984-5hggg           0/1     ImagePullBackOff   0          14m     192.168.1.30   <node ip>   
docker-hub-75d4dfb984-9r57b           0/1     ErrImagePull       0          53s     192.168.0.42   <node ip>   
  • ErrImagePull:表示pod已经调度到node节点,kubelet调用docker去拉取镜像失败。
  • ImagePullBackOff:表示kubelet拉取镜像失败后,不断重试去拉取仍然失败。

2. 查看pod的事件

通过kubectl describe pod 命令查看pod事件,该事件的报错信息在kubelet或docker的日志中也会查看到。

2.1. http: server gave HTTP response to HTTPS client

如果遇到以下报错,尝试将该镜像仓库添加到docker可信任的镜像仓库配置中。

Error getting v2 registry: Get https://docker.com:8080/v2/: http: server gave HTTP response to HTTPS client"

具体操作是修改/etc/docker/daemon.json的insecure-registries参数

#cat /etc/docker/daemon.json
{
	...
  "insecure-registries": [
	...
    "docker.com:8080"
  ],
  ...
}

2.2. no basic auth credentials

如果遇到no basic auth credentials报错,说明kubelet调用docker接口去拉取镜像时,镜像仓库的认证信息失败。

  Normal   BackOff    18s               kubelet, 192.168.1.1  Back-off pulling image "docker.com:8080/public/2048:latest"
  Warning  Failed     18s               kubelet, 192.168.1.1  Error: ImagePullBackOff
  Normal   Pulling    5s (x2 over 18s)  kubelet, 192.168.1.1  Pulling image "docker.com:8080/public/2048:latest"
  Warning  Failed     5s (x2 over 18s)  kubelet, 192.168.1.1  Failed to pull image "docker.com:8080/public/2048:latest": rpc error: code = Unknown desc = Error response from daemon: Get http://docker.com:8080/v2/public/2048/manifests/latest: no basic auth credentials
  Warning  Failed     5s (x2 over 18s)  kubelet, 192.168.1.1  Error: ErrImagePull

具体操作,在拉取镜像失败的节点上登录该镜像仓库,认证信息会更新到 $HOME/.docker/config.json文件中。将该文件拷贝到/var/lib/kubelet/config.json中。

10.4 - PVC Terminating

问题描述

pvc terminating

pvc在删除时,卡在terminating中。

解决方法

kubectl patch pvc {PVC_NAME} -p '{"metadata":{"finalizers":null}}'

11 - 源码分析

11.2 -

11.2.1 -

kube-apiserver源码分析(一)之 NewAPIServerCommand

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析kube-apiservercmd部分的代码,即NewAPIServerCommand相关的代码。更多具体的逻辑待后续文章分析。

kube-apiservercmd部分目录代码结构如下:

kube-apiserver
├── apiserver.go   # kube-apiserver的main入口
└── app
    ├── aggregator.go
    ├── apiextensions.go
    ├── options  # 初始化kube-apiserver使用到的option
    │   ├── options.go     # 包括:NewServerRunOptions、Flags等
    │   ├── options_test.go
    │   └── validation.go
    ├── server.go   # 包括:NewAPIServerCommand、Run、CreateServerChain、Complete等

1. Main

此部分代码位于cmd/kube-apiserver/apiserver.go

func main() {
	rand.Seed(time.Now().UTC().UnixNano())

	command := app.NewAPIServerCommand(server.SetupSignalHandler())

	// TODO: once we switch everything over to Cobra commands, we can go back to calling
	// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
	// normalize func and add the go flag set by hand.
	pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)
	pflag.CommandLine.AddGoFlagSet(goflag.CommandLine)
	// utilflag.InitFlags()
	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		fmt.Fprintf(os.Stderr, "error: %v\n", err)
		os.Exit(1)
	}
}

核心代码:

// 初始化APIServerCommand
command := app.NewAPIServerCommand(server.SetupSignalHandler())
// 执行Execute
err := command.Execute()

2. NewAPIServerCommand

此部分的代码位于/cmd/kube-apiserver/app/server.go

NewAPIServerCommandCobra命令行框架的构造函数,主要包括三部分:

  • 构造option
  • 添加Flags
  • 执行Run函数

完整代码如下:

此部分代码位于cmd/kube-apiserver/app/server.go

// NewAPIServerCommand creates a *cobra.Command object with default parameters
func NewAPIServerCommand(stopCh <-chan struct{}) *cobra.Command {
	s := options.NewServerRunOptions()
	cmd := &cobra.Command{
		Use: "kube-apiserver",
		Long: `The Kubernetes API server validates and configures data
for the api objects which include pods, services, replicationcontrollers, and
others. The API Server services REST operations and provides the frontend to the
cluster's shared state through which all other components interact.`,
		RunE: func(cmd *cobra.Command, args []string) error {
			verflag.PrintAndExitIfRequested()
			utilflag.PrintFlags(cmd.Flags())

			// set default options
			completedOptions, err := Complete(s)
			if err != nil {
				return err
			}

			// validate options
			if errs := completedOptions.Validate(); len(errs) != 0 {
				return utilerrors.NewAggregate(errs)
			}

			return Run(completedOptions, stopCh)
		},
	}

	fs := cmd.Flags()
	namedFlagSets := s.Flags()
	for _, f := range namedFlagSets.FlagSets {
		fs.AddFlagSet(f)
	}

	usageFmt := "Usage:\n  %s\n"
	cols, _, _ := apiserverflag.TerminalSize(cmd.OutOrStdout())
	cmd.SetUsageFunc(func(cmd *cobra.Command) error {
		fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
		apiserverflag.PrintSections(cmd.OutOrStderr(), namedFlagSets, cols)
		return nil
	})
	cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
		fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
		apiserverflag.PrintSections(cmd.OutOrStdout(), namedFlagSets, cols)
	})

	return cmd
}

核心代码:

// 构造option
s := options.NewServerRunOptions()
// 添加flags
fs := cmd.Flags()
namedFlagSets := s.Flags()
for _, f := range namedFlagSets.FlagSets {
	fs.AddFlagSet(f)
}
// set default options
completedOptions, err := Complete(s)
// Run
Run(completedOptions, stopCh)

3. NewServerRunOptions

NewServerRunOptions基于默认的参数构造ServerRunOptions结构体。ServerRunOptions是apiserver运行的配置信息。具体结构体定义如下。

3.1. ServerRunOptions

其中主要的配置如下:

  • GenericServerRunOptions
  • Etcd
  • SecureServing
  • KubeletConfig
  • ...
// ServerRunOptions runs a kubernetes api server.
type ServerRunOptions struct {
	GenericServerRunOptions *genericoptions.ServerRunOptions
	Etcd                    *genericoptions.EtcdOptions
	SecureServing           *genericoptions.SecureServingOptionsWithLoopback
	InsecureServing         *genericoptions.DeprecatedInsecureServingOptionsWithLoopback
	Audit                   *genericoptions.AuditOptions
	Features                *genericoptions.FeatureOptions
	Admission               *kubeoptions.AdmissionOptions
	Authentication          *kubeoptions.BuiltInAuthenticationOptions
	Authorization           *kubeoptions.BuiltInAuthorizationOptions
	CloudProvider           *kubeoptions.CloudProviderOptions
	StorageSerialization    *kubeoptions.StorageSerializationOptions
	APIEnablement           *genericoptions.APIEnablementOptions

	AllowPrivileged           bool
	EnableLogsHandler         bool
	EventTTL                  time.Duration
	KubeletConfig             kubeletclient.KubeletClientConfig
	KubernetesServiceNodePort int
	MaxConnectionBytesPerSec  int64
	ServiceClusterIPRange     net.IPNet // TODO: make this a list
	ServiceNodePortRange      utilnet.PortRange
	SSHKeyfile                string
	SSHUser                   string

	ProxyClientCertFile string
	ProxyClientKeyFile  string

	EnableAggregatorRouting bool

	MasterCount            int
	EndpointReconcilerType string

	ServiceAccountSigningKeyFile string
}

3.2. NewServerRunOptions

NewServerRunOptions初始化配置结构体。

// NewServerRunOptions creates a new ServerRunOptions object with default parameters
func NewServerRunOptions() *ServerRunOptions {
	s := ServerRunOptions{
		GenericServerRunOptions: genericoptions.NewServerRunOptions(),
		Etcd:                 genericoptions.NewEtcdOptions(storagebackend.NewDefaultConfig(kubeoptions.DefaultEtcdPathPrefix, nil)),
		SecureServing:        kubeoptions.NewSecureServingOptions(),
		InsecureServing:      kubeoptions.NewInsecureServingOptions(),
		Audit:                genericoptions.NewAuditOptions(),
		Features:             genericoptions.NewFeatureOptions(),
		Admission:            kubeoptions.NewAdmissionOptions(),
		Authentication:       kubeoptions.NewBuiltInAuthenticationOptions().WithAll(),
		Authorization:        kubeoptions.NewBuiltInAuthorizationOptions(),
		CloudProvider:        kubeoptions.NewCloudProviderOptions(),
		StorageSerialization: kubeoptions.NewStorageSerializationOptions(),
		APIEnablement:        genericoptions.NewAPIEnablementOptions(),

		EnableLogsHandler:      true,
		EventTTL:               1 * time.Hour,
		MasterCount:            1,
		EndpointReconcilerType: string(reconcilers.LeaseEndpointReconcilerType),
		KubeletConfig: kubeletclient.KubeletClientConfig{
			Port:         ports.KubeletPort,
			ReadOnlyPort: ports.KubeletReadOnlyPort,
			PreferredAddressTypes: []string{
				// --override-hostname
				string(api.NodeHostName),

				// internal, preferring DNS if reported
				string(api.NodeInternalDNS),
				string(api.NodeInternalIP),

				// external, preferring DNS if reported
				string(api.NodeExternalDNS),
				string(api.NodeExternalIP),
			},
			EnableHttps: true,
			HTTPTimeout: time.Duration(5) * time.Second,
		},
		ServiceNodePortRange: kubeoptions.DefaultServiceNodePortRange,
	}
	s.ServiceClusterIPRange = kubeoptions.DefaultServiceIPCIDR

	// Overwrite the default for storage data format.
	s.Etcd.DefaultStorageMediaType = "application/vnd.kubernetes.protobuf"

	return &s
}

3.3. Complete

当kube-apiserver的flags被解析后,调用Complete完成默认配置。

此部分代码位于cmd/kube-apiserver/app/server.go

// Should be called after kube-apiserver flags parsed.
func Complete(s *options.ServerRunOptions) (completedServerRunOptions, error) {
	var options completedServerRunOptions
	// set defaults
	if err := s.GenericServerRunOptions.DefaultAdvertiseAddress(s.SecureServing.SecureServingOptions); err != nil {
		return options, err
	}
	if err := kubeoptions.DefaultAdvertiseAddress(s.GenericServerRunOptions, s.InsecureServing.DeprecatedInsecureServingOptions); err != nil {
		return options, err
	}
	serviceIPRange, apiServerServiceIP, err := master.DefaultServiceIPRange(s.ServiceClusterIPRange)
	if err != nil {
		return options, fmt.Errorf("error determining service IP ranges: %v", err)
	}
	s.ServiceClusterIPRange = serviceIPRange
	if err := s.SecureServing.MaybeDefaultWithSelfSignedCerts(s.GenericServerRunOptions.AdvertiseAddress.String(), []string{"kubernetes.default.svc", "kubernetes.default", "kubernetes"}, []net.IP{apiServerServiceIP}); err != nil {
		return options, fmt.Errorf("error creating self-signed certificates: %v", err)
	}

	if len(s.GenericServerRunOptions.ExternalHost) == 0 {
		if len(s.GenericServerRunOptions.AdvertiseAddress) > 0 {
			s.GenericServerRunOptions.ExternalHost = s.GenericServerRunOptions.AdvertiseAddress.String()
		} else {
			if hostname, err := os.Hostname(); err == nil {
				s.GenericServerRunOptions.ExternalHost = hostname
			} else {
				return options, fmt.Errorf("error finding host name: %v", err)
			}
		}
		glog.Infof("external host was not specified, using %v", s.GenericServerRunOptions.ExternalHost)
	}

	s.Authentication.ApplyAuthorization(s.Authorization)

	// Use (ServiceAccountSigningKeyFile != "") as a proxy to the user enabling
	// TokenRequest functionality. This defaulting was convenient, but messed up
	// a lot of people when they rotated their serving cert with no idea it was
	// connected to their service account keys. We are taking this oppurtunity to
	// remove this problematic defaulting.
	if s.ServiceAccountSigningKeyFile == "" {
		// Default to the private server key for service account token signing
		if len(s.Authentication.ServiceAccounts.KeyFiles) == 0 && s.SecureServing.ServerCert.CertKey.KeyFile != "" {
			if kubeauthenticator.IsValidServiceAccountKeyFile(s.SecureServing.ServerCert.CertKey.KeyFile) {
				s.Authentication.ServiceAccounts.KeyFiles = []string{s.SecureServing.ServerCert.CertKey.KeyFile}
			} else {
				glog.Warning("No TLS key provided, service account token authentication disabled")
			}
		}
	}

	if s.Etcd.StorageConfig.DeserializationCacheSize == 0 {
		// When size of cache is not explicitly set, estimate its size based on
		// target memory usage.
		glog.V(2).Infof("Initializing deserialization cache size based on %dMB limit", s.GenericServerRunOptions.TargetRAMMB)

		// This is the heuristics that from memory capacity is trying to infer
		// the maximum number of nodes in the cluster and set cache sizes based
		// on that value.
		// From our documentation, we officially recommend 120GB machines for
		// 2000 nodes, and we scale from that point. Thus we assume ~60MB of
		// capacity per node.
		// TODO: We may consider deciding that some percentage of memory will
		// be used for the deserialization cache and divide it by the max object
		// size to compute its size. We may even go further and measure
		// collective sizes of the objects in the cache.
		clusterSize := s.GenericServerRunOptions.TargetRAMMB / 60
		s.Etcd.StorageConfig.DeserializationCacheSize = 25 * clusterSize
		if s.Etcd.StorageConfig.DeserializationCacheSize < 1000 {
			s.Etcd.StorageConfig.DeserializationCacheSize = 1000
		}
	}
	if s.Etcd.EnableWatchCache {
		glog.V(2).Infof("Initializing cache sizes based on %dMB limit", s.GenericServerRunOptions.TargetRAMMB)
		sizes := cachesize.NewHeuristicWatchCacheSizes(s.GenericServerRunOptions.TargetRAMMB)
		if userSpecified, err := serveroptions.ParseWatchCacheSizes(s.Etcd.WatchCacheSizes); err == nil {
			for resource, size := range userSpecified {
				sizes[resource] = size
			}
		}
		s.Etcd.WatchCacheSizes, err = serveroptions.WriteWatchCacheSizes(sizes)
		if err != nil {
			return options, err
		}
	}

	// TODO: remove when we stop supporting the legacy group version.
	if s.APIEnablement.RuntimeConfig != nil {
		for key, value := range s.APIEnablement.RuntimeConfig {
			if key == "v1" || strings.HasPrefix(key, "v1/") ||
				key == "api/v1" || strings.HasPrefix(key, "api/v1/") {
				delete(s.APIEnablement.RuntimeConfig, key)
				s.APIEnablement.RuntimeConfig["/v1"] = value
			}
			if key == "api/legacy" {
				delete(s.APIEnablement.RuntimeConfig, key)
			}
		}
	}
	options.ServerRunOptions = s
	return options, nil
}

3. AddFlagSet

AddFlagSet主要的作用是通过外部传入的flag的具体值,解析的时候传递给option的结构体,最终给apiserver使用。

其中NewAPIServerCommand关于AddFlagSet的相关代码如下:

fs := cmd.Flags()
namedFlagSets := s.Flags()
for _, f := range namedFlagSets.FlagSets {
	fs.AddFlagSet(f)
}

3.1. Flags

Flags完整代码如下:

此部分代码位于cmd/kube-apiserver/app/options/options.go

// Flags returns flags for a specific APIServer by section name
func (s *ServerRunOptions) Flags() (fss apiserverflag.NamedFlagSets) {
	// Add the generic flags.
	s.GenericServerRunOptions.AddUniversalFlags(fss.FlagSet("generic"))
	s.Etcd.AddFlags(fss.FlagSet("etcd"))
	s.SecureServing.AddFlags(fss.FlagSet("secure serving"))
	s.InsecureServing.AddFlags(fss.FlagSet("insecure serving"))
	s.InsecureServing.AddUnqualifiedFlags(fss.FlagSet("insecure serving")) // TODO: remove it until kops stops using `--address`
	s.Audit.AddFlags(fss.FlagSet("auditing"))
	s.Features.AddFlags(fss.FlagSet("features"))
	s.Authentication.AddFlags(fss.FlagSet("authentication"))
	s.Authorization.AddFlags(fss.FlagSet("authorization"))
	s.CloudProvider.AddFlags(fss.FlagSet("cloud provider"))
	s.StorageSerialization.AddFlags(fss.FlagSet("storage"))
	s.APIEnablement.AddFlags(fss.FlagSet("api enablement"))
	s.Admission.AddFlags(fss.FlagSet("admission"))

	// Note: the weird ""+ in below lines seems to be the only way to get gofmt to
	// arrange these text blocks sensibly. Grrr.
	fs := fss.FlagSet("misc")
	fs.DurationVar(&s.EventTTL, "event-ttl", s.EventTTL,
		"Amount of time to retain events.")

	fs.BoolVar(&s.AllowPrivileged, "allow-privileged", s.AllowPrivileged,
		"If true, allow privileged containers. [default=false]")

	fs.BoolVar(&s.EnableLogsHandler, "enable-logs-handler", s.EnableLogsHandler,
		"If true, install a /logs handler for the apiserver logs.")

	// Deprecated in release 1.9
	fs.StringVar(&s.SSHUser, "ssh-user", s.SSHUser,
		"If non-empty, use secure SSH proxy to the nodes, using this user name")
	fs.MarkDeprecated("ssh-user", "This flag will be removed in a future version.")

	// Deprecated in release 1.9
	fs.StringVar(&s.SSHKeyfile, "ssh-keyfile", s.SSHKeyfile,
		"If non-empty, use secure SSH proxy to the nodes, using this user keyfile")
	fs.MarkDeprecated("ssh-keyfile", "This flag will be removed in a future version.")

	fs.Int64Var(&s.MaxConnectionBytesPerSec, "max-connection-bytes-per-sec", s.MaxConnectionBytesPerSec, ""+
		"If non-zero, throttle each user connection to this number of bytes/sec. "+
		"Currently only applies to long-running requests.")

	fs.IntVar(&s.MasterCount, "apiserver-count", s.MasterCount,
		"The number of apiservers running in the cluster, must be a positive number. (In use when --endpoint-reconciler-type=master-count is enabled.)")

	fs.StringVar(&s.EndpointReconcilerType, "endpoint-reconciler-type", string(s.EndpointReconcilerType),
		"Use an endpoint reconciler ("+strings.Join(reconcilers.AllTypes.Names(), ", ")+")")

	// See #14282 for details on how to test/try this option out.
	// TODO: remove this comment once this option is tested in CI.
	fs.IntVar(&s.KubernetesServiceNodePort, "kubernetes-service-node-port", s.KubernetesServiceNodePort, ""+
		"If non-zero, the Kubernetes master service (which apiserver creates/maintains) will be "+
		"of type NodePort, using this as the value of the port. If zero, the Kubernetes master "+
		"service will be of type ClusterIP.")

	fs.IPNetVar(&s.ServiceClusterIPRange, "service-cluster-ip-range", s.ServiceClusterIPRange, ""+
		"A CIDR notation IP range from which to assign service cluster IPs. This must not "+
		"overlap with any IP ranges assigned to nodes for pods.")

	fs.Var(&s.ServiceNodePortRange, "service-node-port-range", ""+
		"A port range to reserve for services with NodePort visibility. "+
		"Example: '30000-32767'. Inclusive at both ends of the range.")

	// Kubelet related flags:
	fs.BoolVar(&s.KubeletConfig.EnableHttps, "kubelet-https", s.KubeletConfig.EnableHttps,
		"Use https for kubelet connections.")

	fs.StringSliceVar(&s.KubeletConfig.PreferredAddressTypes, "kubelet-preferred-address-types", s.KubeletConfig.PreferredAddressTypes,
		"List of the preferred NodeAddressTypes to use for kubelet connections.")

	fs.UintVar(&s.KubeletConfig.Port, "kubelet-port", s.KubeletConfig.Port,
		"DEPRECATED: kubelet port.")
	fs.MarkDeprecated("kubelet-port", "kubelet-port is deprecated and will be removed.")

	fs.UintVar(&s.KubeletConfig.ReadOnlyPort, "kubelet-read-only-port", s.KubeletConfig.ReadOnlyPort,
		"DEPRECATED: kubelet port.")

	fs.DurationVar(&s.KubeletConfig.HTTPTimeout, "kubelet-timeout", s.KubeletConfig.HTTPTimeout,
		"Timeout for kubelet operations.")

	fs.StringVar(&s.KubeletConfig.CertFile, "kubelet-client-certificate", s.KubeletConfig.CertFile,
		"Path to a client cert file for TLS.")

	fs.StringVar(&s.KubeletConfig.KeyFile, "kubelet-client-key", s.KubeletConfig.KeyFile,
		"Path to a client key file for TLS.")

	fs.StringVar(&s.KubeletConfig.CAFile, "kubelet-certificate-authority", s.KubeletConfig.CAFile,
		"Path to a cert file for the certificate authority.")

	// TODO: delete this flag in 1.13
	repair := false
	fs.BoolVar(&repair, "repair-malformed-updates", false, "deprecated")
	fs.MarkDeprecated("repair-malformed-updates", "This flag will be removed in a future version")

	fs.StringVar(&s.ProxyClientCertFile, "proxy-client-cert-file", s.ProxyClientCertFile, ""+
		"Client certificate used to prove the identity of the aggregator or kube-apiserver "+
		"when it must call out during a request. This includes proxying requests to a user "+
		"api-server and calling out to webhook admission plugins. It is expected that this "+
		"cert includes a signature from the CA in the --requestheader-client-ca-file flag. "+
		"That CA is published in the 'extension-apiserver-authentication' configmap in "+
		"the kube-system namespace. Components receiving calls from kube-aggregator should "+
		"use that CA to perform their half of the mutual TLS verification.")
	fs.StringVar(&s.ProxyClientKeyFile, "proxy-client-key-file", s.ProxyClientKeyFile, ""+
		"Private key for the client certificate used to prove the identity of the aggregator or kube-apiserver "+
		"when it must call out during a request. This includes proxying requests to a user "+
		"api-server and calling out to webhook admission plugins.")

	fs.BoolVar(&s.EnableAggregatorRouting, "enable-aggregator-routing", s.EnableAggregatorRouting,
		"Turns on aggregator routing requests to endpoints IP rather than cluster IP.")

	fs.StringVar(&s.ServiceAccountSigningKeyFile, "service-account-signing-key-file", s.ServiceAccountSigningKeyFile, ""+
		"Path to the file that contains the current private key of the service account token issuer. The issuer will sign issued ID tokens with this private key. (Requires the 'TokenRequest' feature gate.)")

	return fss
}

4. Run

Run以常驻的方式运行apiserver。

主要内容如下:

  1. 构造一个聚合的server结构体。
  2. 执行PrepareRun。
  3. 最终执行Run。

此部分代码位于cmd/kube-apiserver/app/server.go

// Run runs the specified APIServer.  This should never exit.
func Run(completeOptions completedServerRunOptions, stopCh <-chan struct{}) error {
	// To help debugging, immediately log version
	glog.Infof("Version: %+v", version.Get())

	server, err := CreateServerChain(completeOptions, stopCh)
	if err != nil {
		return err
	}

	return server.PrepareRun().Run(stopCh)
}

4.1. CreateServerChain

构造聚合的Server。

基本流程如下:

  1. 首先生成config对象,包括kubeAPIServerConfigapiExtensionsConfig
  2. 再通过config生成server对象,包括apiExtensionsServerkubeAPIServer
  3. 执行apiExtensionsServerkubeAPIServerPrepareRun部分。
  4. 生成聚合的config对象aggregatorConfig
  5. 基于aggregatorConfigkubeAPIServerapiExtensionsServer生成聚合的serveraggregatorServer

此部分代码位于cmd/kube-apiserver/app/server.go

// CreateServerChain creates the apiservers connected via delegation.
func CreateServerChain(completedOptions completedServerRunOptions, stopCh <-chan struct{}) (*genericapiserver.GenericAPIServer, error) {
	nodeTunneler, proxyTransport, err := CreateNodeDialer(completedOptions)
	if err != nil {
		return nil, err
	}

	kubeAPIServerConfig, insecureServingInfo, serviceResolver, pluginInitializer, admissionPostStartHook, err := CreateKubeAPIServerConfig(completedOptions, nodeTunneler, proxyTransport)
	if err != nil {
		return nil, err
	}

	// If additional API servers are added, they should be gated.
	apiExtensionsConfig, err := createAPIExtensionsConfig(*kubeAPIServerConfig.GenericConfig, kubeAPIServerConfig.ExtraConfig.VersionedInformers, pluginInitializer, completedOptions.ServerRunOptions, completedOptions.MasterCount)
	if err != nil {
		return nil, err
	}
	apiExtensionsServer, err := createAPIExtensionsServer(apiExtensionsConfig, genericapiserver.NewEmptyDelegate())
	if err != nil {
		return nil, err
	}

	kubeAPIServer, err := CreateKubeAPIServer(kubeAPIServerConfig, apiExtensionsServer.GenericAPIServer, admissionPostStartHook)
	if err != nil {
		return nil, err
	}

	// otherwise go down the normal path of standing the aggregator up in front of the API server
	// this wires up openapi
	kubeAPIServer.GenericAPIServer.PrepareRun()

	// This will wire up openapi for extension api server
	apiExtensionsServer.GenericAPIServer.PrepareRun()

	// aggregator comes last in the chain
	aggregatorConfig, err := createAggregatorConfig(*kubeAPIServerConfig.GenericConfig, completedOptions.ServerRunOptions, kubeAPIServerConfig.ExtraConfig.VersionedInformers, serviceResolver, proxyTransport, pluginInitializer)
	if err != nil {
		return nil, err
	}
	aggregatorServer, err := createAggregatorServer(aggregatorConfig, kubeAPIServer.GenericAPIServer, apiExtensionsServer.Informers)
	if err != nil {
		// we don't need special handling for innerStopCh because the aggregator server doesn't create any go routines
		return nil, err
	}

	if insecureServingInfo != nil {
		insecureHandlerChain := kubeserver.BuildInsecureHandlerChain(aggregatorServer.GenericAPIServer.UnprotectedHandler(), kubeAPIServerConfig.GenericConfig)
		if err := insecureServingInfo.Serve(insecureHandlerChain, kubeAPIServerConfig.GenericConfig.RequestTimeout, stopCh); err != nil {
			return nil, err
		}
	}

	return aggregatorServer.GenericAPIServer, nil
}

4.2. PrepareRun

PrepareRun主要执行一些API安装操作。

此部分的代码位于vendor/k8s.io/apiserver/pkg/server/genericapiserver.go

// PrepareRun does post API installation setup steps.
func (s *GenericAPIServer) PrepareRun() preparedGenericAPIServer {
	if s.swaggerConfig != nil {
		routes.Swagger{Config: s.swaggerConfig}.Install(s.Handler.GoRestfulContainer)
	}
	if s.openAPIConfig != nil {
		routes.OpenAPI{
			Config: s.openAPIConfig,
		}.Install(s.Handler.GoRestfulContainer, s.Handler.NonGoRestfulMux)
	}

	s.installHealthz()

	// Register audit backend preShutdownHook.
	if s.AuditBackend != nil {
		s.AddPreShutdownHook("audit-backend", func() error {
			s.AuditBackend.Shutdown()
			return nil
		})
	}

	return preparedGenericAPIServer{s}
}

4.3. preparedGenericAPIServer.Run

preparedGenericAPIServer.Run运行一个安全的http server。具体的实现逻辑待后续文章分析。

此部分代码位于vendor/k8s.io/apiserver/pkg/server/genericapiserver.go

// Run spawns the secure http server. It only returns if stopCh is closed
// or the secure port cannot be listened on initially.
func (s preparedGenericAPIServer) Run(stopCh <-chan struct{}) error {
	err := s.NonBlockingRun(stopCh)
	if err != nil {
		return err
	}

	<-stopCh

	err = s.RunPreShutdownHooks()
	if err != nil {
		return err
	}

	// Wait for all requests to finish, which are bounded by the RequestTimeout variable.
	s.HandlerChainWaitGroup.Wait()

	return nil
}

核心函数:

err := s.NonBlockingRun(stopCh)

preparedGenericAPIServer.Run主要是调用NonBlockingRun函数,最终运行一个http server。该部分逻辑待后续文章分析。

5. 总结

NewAPIServerCommand采用了Cobra命令行框架,该框架使用主要包含以下部分:

  • 构造option参数,提供给执行主体(例如 本文的server)作为配置参数使用。
  • 添加Flags,主要用来通过传入的flags参数最终解析成option中使用的结构体属性。
  • 执行Run函数,执行主体的运行逻辑部分(核心部分)。

其中Run函数的主要内容如下:

  1. 构造一个聚合的server结构体。
  2. 执行PrepareRun。
  3. 最终执行preparedGenericAPIServer.Run。

preparedGenericAPIServer.Run主要是调用NonBlockingRun函数,最终运行一个http server。NonBlockingRun的具体逻辑待后续文章再单独分析。

参考:

11.3 -

11.3.1 -

kube-controller-manager源码分析(二)之 DeploymentController

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要以deployment controller为例,分析该类controller的运行逻辑。此部分代码主要为位于pkg/controller/deploymentpkg/controller部分的代码包括了各种类型的controller的具体实现。

controller managerpkg部分代码目录结构如下:

controller  # 主要包含各种controller的具体实现
├── apis
├── bootstrap
├── certificates
├── client_builder.go
├── cloud
├── clusterroleaggregation
├── controller_ref_manager.go
├── controller_utils.go  # WaitForCacheSync
├── cronjob
├── daemon
├── deployment   # deployment controller
│   ├── deployment_controller.go # NewDeploymentController、Run、syncDeployment
│   ├── progress.go   # syncRolloutStatus
│   ├── recreate.go   # rolloutRecreate
│   ├── rollback.go   # rollback
│   ├── rolling.go    # rolloutRolling
│   ├── sync.go
├── disruption  # disruption controller
├── endpoint
├── garbagecollector
├── history
├── job
├── lookup_cache.go
├── namespace   # namespace controller
├── nodeipam
├── nodelifecycle
├── podautoscaler
├── podgc
├── replicaset   # replicaset controller
├── replication  # replication controller
├── resourcequota
├── route
├── service   # service controller
├── serviceaccount
├── statefulset   # statefulset controller
└── volume  # PersistentVolumeController、AttachDetachController、PVCProtectionController

1. startDeploymentController

func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) {
	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] {
		return nil, false, nil
	}
	dc, err := deployment.NewDeploymentController(
		ctx.InformerFactory.Apps().V1().Deployments(),
		ctx.InformerFactory.Apps().V1().ReplicaSets(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.ClientBuilder.ClientOrDie("deployment-controller"),
	)
	if err != nil {
		return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
	}
	go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop)
	return nil, true, nil
}

startDeploymentController主要调用的函数为NewDeploymentController和对应的Run函数。该部分逻辑在kubernetes/pkg/controller中。

2. NewDeploymentController

NewDeploymentController主要构建DeploymentController结构体。

该部分主要处理了以下逻辑:

  • 构建并运行事件处理器eventBroadcaster
  • 初始化赋值rsControlclientsetworkqueue
  • 添加dInformerrsInformerpodInformerResourceEventHandlerFuncs,其中主要为AddFuncUpdateFuncDeleteFunc三类方法。
  • 构造deployment、rs、pod的Informer的Lister函数和HasSynced函数。
  • 调用syncHandler,来实现syncDeployment

2.1. eventBroadcaster

调用事件处理器来记录deployment相关的事件。

eventBroadcaster := record.NewBroadcaster()
eventBroadcaster.StartLogging(glog.Infof)
// TODO: remove the wrapper when every clients have moved to use the clientset.
eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: v1core.New(client.CoreV1().RESTClient()).Events("")})

2.2. rsControl

构造DeploymentController,包括clientsetworkqueuersControl。其中rsControl是具体实现rs逻辑的controller。

dc := &DeploymentController{
	client:        client,
	eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}),
	queue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
}
dc.rsControl = controller.RealRSControl{
	KubeClient: client,
	Recorder:   dc.eventRecorder,
}

2.3. Informer().AddEventHandler

添加dInformerrsInformerpodInformerResourceEventHandlerFuncs,其中主要为AddFuncUpdateFuncDeleteFunc三类方法。

dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
	AddFunc:    dc.addDeployment,
	UpdateFunc: dc.updateDeployment,
	// This will enter the sync loop and no-op, because the deployment has been deleted from the store.
	DeleteFunc: dc.deleteDeployment,
})
rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
	AddFunc:    dc.addReplicaSet,
	UpdateFunc: dc.updateReplicaSet,
	DeleteFunc: dc.deleteReplicaSet,
})
podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
	DeleteFunc: dc.deletePod,
})

2.4. Informer.Lister()

调用dInformerrsInformerpodInformerLister()方法。

dc.dLister = dInformer.Lister()
dc.rsLister = rsInformer.Lister()
dc.podLister = podInformer.Lister()

2.5. Informer().HasSynced

调用Informer().HasSynced,判断是否缓存完成;

dc.dListerSynced = dInformer.Informer().HasSynced
dc.rsListerSynced = rsInformer.Informer().HasSynced
dc.podListerSynced = podInformer.Informer().HasSynced

2.6. syncHandler

syncHandler具体为syncDeployment,syncHandler负责deployment的同步实现。

dc.syncHandler = dc.syncDeployment
dc.enqueueDeployment = dc.enqueue

完整代码如下:

// NewDeploymentController creates a new DeploymentController.
func NewDeploymentController(dInformer extensionsinformers.DeploymentInformer, rsInformer extensionsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
	eventBroadcaster := record.NewBroadcaster()
	eventBroadcaster.StartLogging(glog.Infof)
	// TODO: remove the wrapper when every clients have moved to use the clientset.
	eventBroadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: v1core.New(client.CoreV1().RESTClient()).Events("")})

	if client != nil && client.CoreV1().RESTClient().GetRateLimiter() != nil {
		if err := metrics.RegisterMetricAndTrackRateLimiterUsage("deployment_controller", client.CoreV1().RESTClient().GetRateLimiter()); err != nil {
			return nil, err
		}
	}
	dc := &DeploymentController{
		client:        client,
		eventRecorder: eventBroadcaster.NewRecorder(scheme.Scheme, v1.EventSource{Component: "deployment-controller"}),
		queue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
	}
	dc.rsControl = controller.RealRSControl{
		KubeClient: client,
		Recorder:   dc.eventRecorder,
	}

	dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    dc.addDeployment,
		UpdateFunc: dc.updateDeployment,
		// This will enter the sync loop and no-op, because the deployment has been deleted from the store.
		DeleteFunc: dc.deleteDeployment,
	})
	rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    dc.addReplicaSet,
		UpdateFunc: dc.updateReplicaSet,
		DeleteFunc: dc.deleteReplicaSet,
	})
	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		DeleteFunc: dc.deletePod,
	})

	dc.syncHandler = dc.syncDeployment
	dc.enqueueDeployment = dc.enqueue

	dc.dLister = dInformer.Lister()
	dc.rsLister = rsInformer.Lister()
	dc.podLister = podInformer.Lister()
	dc.dListerSynced = dInformer.Informer().HasSynced
	dc.rsListerSynced = rsInformer.Informer().HasSynced
	dc.podListerSynced = podInformer.Informer().HasSynced
	return dc, nil
}

3. DeploymentController.Run

Run执行watch和sync的操作。

// Run begins watching and syncing.
func (dc *DeploymentController) Run(workers int, stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	defer dc.queue.ShutDown()

	glog.Infof("Starting deployment controller")
	defer glog.Infof("Shutting down deployment controller")

	if !controller.WaitForCacheSync("deployment", stopCh, dc.dListerSynced, dc.rsListerSynced, dc.podListerSynced) {
		return
	}

	for i := 0; i < workers; i++ {
		go wait.Until(dc.worker, time.Second, stopCh)
	}

	<-stopCh
}

3.1. WaitForCacheSync

WaitForCacheSync主要是用来在List-Watch机制中可以保持当前cache的数据与etcd的数据一致。

// WaitForCacheSync is a wrapper around cache.WaitForCacheSync that generates log messages
// indicating that the controller identified by controllerName is waiting for syncs, followed by
// either a successful or failed sync.
func WaitForCacheSync(controllerName string, stopCh <-chan struct{}, cacheSyncs ...cache.InformerSynced) bool {
	glog.Infof("Waiting for caches to sync for %s controller", controllerName)

	if !cache.WaitForCacheSync(stopCh, cacheSyncs...) {
		utilruntime.HandleError(fmt.Errorf("Unable to sync caches for %s controller", controllerName))
		return false
	}

	glog.Infof("Caches are synced for %s controller", controllerName)
	return true
}

3.2. dc.worker

worker调用了processNextWorkItemprocessNextWorkItem最终调用了syncHandler,而syncHandlerNewDeploymentController中赋值的具体函数为syncDeployment

// worker runs a worker thread that just dequeues items, processes them, and marks them done.
// It enforces that the syncHandler is never invoked concurrently with the same key.
func (dc *DeploymentController) worker() {
	for dc.processNextWorkItem() {
	}
}

func (dc *DeploymentController) processNextWorkItem() bool {
	key, quit := dc.queue.Get()
	if quit {
		return false
	}
	defer dc.queue.Done(key)

	err := dc.syncHandler(key.(string))
	dc.handleErr(err, key)

	return true
}

NewDeploymentController中的syncHandler赋值:

func NewDeploymentController(dInformer appsinformers.DeploymentInformer, rsInformer appsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
	...
  dc.syncHandler = dc.syncDeployment
  ...
}  

4. syncDeployment

syncDeployment基于给定的key执行sync deployment的操作。

主要流程如下:

  1. 通过SplitMetaNamespaceKey获取namespace和deployment对象的name。
  2. 调用Lister的接口获取的deployment的对象。
  3. getReplicaSetsForDeployment获取deployment管理的ReplicaSet对象。
  4. getPodMapForDeployment获取deployment管理的pod,基于ReplicaSet来分组。
  5. checkPausedConditions检查deployment是否是pause状态并添加合适的condition
  6. isScalingEvent检查deployment的更新是否来自于一个scale的事件,如果是则执行scale的操作。
  7. 根据DeploymentStrategyType类型执行rolloutRecreaterolloutRolling

完整代码如下:

// syncDeployment will sync the deployment with the given key.
// This function is not meant to be invoked concurrently with the same key.
func (dc *DeploymentController) syncDeployment(key string) error {
	startTime := time.Now()
	glog.V(4).Infof("Started syncing deployment %q (%v)", key, startTime)
	defer func() {
		glog.V(4).Infof("Finished syncing deployment %q (%v)", key, time.Since(startTime))
	}()

	namespace, name, err := cache.SplitMetaNamespaceKey(key)
	if err != nil {
		return err
	}
	deployment, err := dc.dLister.Deployments(namespace).Get(name)
	if errors.IsNotFound(err) {
		glog.V(2).Infof("Deployment %v has been deleted", key)
		return nil
	}
	if err != nil {
		return err
	}

	// Deep-copy otherwise we are mutating our cache.
	// TODO: Deep-copy only when needed.
	d := deployment.DeepCopy()

	everything := metav1.LabelSelector{}
	if reflect.DeepEqual(d.Spec.Selector, &everything) {
		dc.eventRecorder.Eventf(d, v1.EventTypeWarning, "SelectingAll", "This deployment is selecting all pods. A non-empty selector is required.")
		if d.Status.ObservedGeneration < d.Generation {
			d.Status.ObservedGeneration = d.Generation
			dc.client.ExtensionsV1beta1().Deployments(d.Namespace).UpdateStatus(d)
		}
		return nil
	}

	// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
	// through adoption/orphaning.
	rsList, err := dc.getReplicaSetsForDeployment(d)
	if err != nil {
		return err
	}
	// List all Pods owned by this Deployment, grouped by their ReplicaSet.
	// Current uses of the podMap are:
	//
	// * check if a Pod is labeled correctly with the pod-template-hash label.
	// * check that no old Pods are running in the middle of Recreate Deployments.
	podMap, err := dc.getPodMapForDeployment(d, rsList)
	if err != nil {
		return err
	}

	if d.DeletionTimestamp != nil {
		return dc.syncStatusOnly(d, rsList, podMap)
	}

	// Update deployment conditions with an Unknown condition when pausing/resuming
	// a deployment. In this way, we can be sure that we won't timeout when a user
	// resumes a Deployment with a set progressDeadlineSeconds.
	if err = dc.checkPausedConditions(d); err != nil {
		return err
	}

	if d.Spec.Paused {
		return dc.sync(d, rsList, podMap)
	}

	// rollback is not re-entrant in case the underlying replica sets are updated with a new
	// revision so we should ensure that we won't proceed to update replica sets until we
	// make sure that the deployment has cleaned up its rollback spec in subsequent enqueues.
	if d.Spec.RollbackTo != nil {
		return dc.rollback(d, rsList, podMap)
	}

	scalingEvent, err := dc.isScalingEvent(d, rsList, podMap)
	if err != nil {
		return err
	}
	if scalingEvent {
		return dc.sync(d, rsList, podMap)
	}

	switch d.Spec.Strategy.Type {
	case extensions.RecreateDeploymentStrategyType:
		return dc.rolloutRecreate(d, rsList, podMap)
	case extensions.RollingUpdateDeploymentStrategyType:
		return dc.rolloutRolling(d, rsList, podMap)
	}
	return fmt.Errorf("unexpected deployment strategy type: %s", d.Spec.Strategy.Type)
}

4.1. Get deployment

// get namespace and deployment name
namespace, name, err := cache.SplitMetaNamespaceKey(key)
// get deployment by name
deployment, err := dc.dLister.Deployments(namespace).Get(name)

4.2. getReplicaSetsForDeployment

// List ReplicaSets owned by this Deployment, while reconciling ControllerRef
// through adoption/orphaning.
rsList, err := dc.getReplicaSetsForDeployment(d)

getReplicaSetsForDeployment具体代码:

// getReplicaSetsForDeployment uses ControllerRefManager to reconcile
// ControllerRef by adopting and orphaning.
// It returns the list of ReplicaSets that this Deployment should manage.
func (dc *DeploymentController) getReplicaSetsForDeployment(d *apps.Deployment) ([]*apps.ReplicaSet, error) {
	// List all ReplicaSets to find those we own but that no longer match our
	// selector. They will be orphaned by ClaimReplicaSets().
	rsList, err := dc.rsLister.ReplicaSets(d.Namespace).List(labels.Everything())
	if err != nil {
		return nil, err
	}
	deploymentSelector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
	if err != nil {
		return nil, fmt.Errorf("deployment %s/%s has invalid label selector: %v", d.Namespace, d.Name, err)
	}
	// If any adoptions are attempted, we should first recheck for deletion with
	// an uncached quorum read sometime after listing ReplicaSets (see #42639).
	canAdoptFunc := controller.RecheckDeletionTimestamp(func() (metav1.Object, error) {
		fresh, err := dc.client.AppsV1().Deployments(d.Namespace).Get(d.Name, metav1.GetOptions{})
		if err != nil {
			return nil, err
		}
		if fresh.UID != d.UID {
			return nil, fmt.Errorf("original Deployment %v/%v is gone: got uid %v, wanted %v", d.Namespace, d.Name, fresh.UID, d.UID)
		}
		return fresh, nil
	})
	cm := controller.NewReplicaSetControllerRefManager(dc.rsControl, d, deploymentSelector, controllerKind, canAdoptFunc)
	return cm.ClaimReplicaSets(rsList)
}

4.3. getPodMapForDeployment

// List all Pods owned by this Deployment, grouped by their ReplicaSet.
// Current uses of the podMap are:
//
// * check if a Pod is labeled correctly with the pod-template-hash label.
// * check that no old Pods are running in the middle of Recreate Deployments.
podMap, err := dc.getPodMapForDeployment(d, rsList)

getPodMapForDeployment具体代码:

// getPodMapForDeployment returns the Pods managed by a Deployment.
//
// It returns a map from ReplicaSet UID to a list of Pods controlled by that RS,
// according to the Pod's ControllerRef.
func (dc *DeploymentController) getPodMapForDeployment(d *apps.Deployment, rsList []*apps.ReplicaSet) (map[types.UID]*v1.PodList, error) {
	// Get all Pods that potentially belong to this Deployment.
	selector, err := metav1.LabelSelectorAsSelector(d.Spec.Selector)
	if err != nil {
		return nil, err
	}
	pods, err := dc.podLister.Pods(d.Namespace).List(selector)
	if err != nil {
		return nil, err
	}
	// Group Pods by their controller (if it's in rsList).
	podMap := make(map[types.UID]*v1.PodList, len(rsList))
	for _, rs := range rsList {
		podMap[rs.UID] = &v1.PodList{}
	}
	for _, pod := range pods {
		// Do not ignore inactive Pods because Recreate Deployments need to verify that no
		// Pods from older versions are running before spinning up new Pods.
		controllerRef := metav1.GetControllerOf(pod)
		if controllerRef == nil {
			continue
		}
		// Only append if we care about this UID.
		if podList, ok := podMap[controllerRef.UID]; ok {
			podList.Items = append(podList.Items, *pod)
		}
	}
	return podMap, nil
}

4.4. checkPausedConditions

// Update deployment conditions with an Unknown condition when pausing/resuming
// a deployment. In this way, we can be sure that we won't timeout when a user
// resumes a Deployment with a set progressDeadlineSeconds.
if err = dc.checkPausedConditions(d); err != nil {
	return err
}

if d.Spec.Paused {
	return dc.sync(d, rsList)
}

checkPausedConditions具体代码:

// checkPausedConditions checks if the given deployment is paused or not and adds an appropriate condition.
// These conditions are needed so that we won't accidentally report lack of progress for resumed deployments
// that were paused for longer than progressDeadlineSeconds.
func (dc *DeploymentController) checkPausedConditions(d *apps.Deployment) error {
	if !deploymentutil.HasProgressDeadline(d) {
		return nil
	}
	cond := deploymentutil.GetDeploymentCondition(d.Status, apps.DeploymentProgressing)
	if cond != nil && cond.Reason == deploymentutil.TimedOutReason {
		// If we have reported lack of progress, do not overwrite it with a paused condition.
		return nil
	}
	pausedCondExists := cond != nil && cond.Reason == deploymentutil.PausedDeployReason

	needsUpdate := false
	if d.Spec.Paused && !pausedCondExists {
		condition := deploymentutil.NewDeploymentCondition(apps.DeploymentProgressing, v1.ConditionUnknown, deploymentutil.PausedDeployReason, "Deployment is paused")
		deploymentutil.SetDeploymentCondition(&d.Status, *condition)
		needsUpdate = true
	} else if !d.Spec.Paused && pausedCondExists {
		condition := deploymentutil.NewDeploymentCondition(apps.DeploymentProgressing, v1.ConditionUnknown, deploymentutil.ResumedDeployReason, "Deployment is resumed")
		deploymentutil.SetDeploymentCondition(&d.Status, *condition)
		needsUpdate = true
	}

	if !needsUpdate {
		return nil
	}

	var err error
	d, err = dc.client.AppsV1().Deployments(d.Namespace).UpdateStatus(d)
	return err
}

4.5. isScalingEvent

scalingEvent, err := dc.isScalingEvent(d, rsList)
if err != nil {
	return err
}
if scalingEvent {
	return dc.sync(d, rsList)
}

isScalingEvent具体代码:

// isScalingEvent checks whether the provided deployment has been updated with a scaling event
// by looking at the desired-replicas annotation in the active replica sets of the deployment.
//
// rsList should come from getReplicaSetsForDeployment(d).
// podMap should come from getPodMapForDeployment(d, rsList).
func (dc *DeploymentController) isScalingEvent(d *apps.Deployment, rsList []*apps.ReplicaSet) (bool, error) {
	newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
	if err != nil {
		return false, err
	}
	allRSs := append(oldRSs, newRS)
	for _, rs := range controller.FilterActiveReplicaSets(allRSs) {
		desired, ok := deploymentutil.GetDesiredReplicasAnnotation(rs)
		if !ok {
			continue
		}
		if desired != *(d.Spec.Replicas) {
			return true, nil
		}
	}
	return false, nil
}

4.6. rolloutRecreate

switch d.Spec.Strategy.Type {
case apps.RecreateDeploymentStrategyType:
	return dc.rolloutRecreate(d, rsList, podMap)

rolloutRecreate具体代码:

// rolloutRecreate implements the logic for recreating a replica set.
func (dc *DeploymentController) rolloutRecreate(d *apps.Deployment, rsList []*apps.ReplicaSet, podMap map[types.UID]*v1.PodList) error {
	// Don't create a new RS if not already existed, so that we avoid scaling up before scaling down.
	newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, false)
	if err != nil {
		return err
	}
	allRSs := append(oldRSs, newRS)
	activeOldRSs := controller.FilterActiveReplicaSets(oldRSs)

	// scale down old replica sets.
	scaledDown, err := dc.scaleDownOldReplicaSetsForRecreate(activeOldRSs, d)
	if err != nil {
		return err
	}
	if scaledDown {
		// Update DeploymentStatus.
		return dc.syncRolloutStatus(allRSs, newRS, d)
	}

	// Do not process a deployment when it has old pods running.
	if oldPodsRunning(newRS, oldRSs, podMap) {
		return dc.syncRolloutStatus(allRSs, newRS, d)
	}

	// If we need to create a new RS, create it now.
	if newRS == nil {
		newRS, oldRSs, err = dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
		if err != nil {
			return err
		}
		allRSs = append(oldRSs, newRS)
	}

	// scale up new replica set.
	if _, err := dc.scaleUpNewReplicaSetForRecreate(newRS, d); err != nil {
		return err
	}

	if util.DeploymentComplete(d, &d.Status) {
		if err := dc.cleanupDeployment(oldRSs, d); err != nil {
			return err
		}
	}

	// Sync deployment status.
	return dc.syncRolloutStatus(allRSs, newRS, d)
}

4.7. rolloutRolling

switch d.Spec.Strategy.Type {
case apps.RecreateDeploymentStrategyType:
	return dc.rolloutRecreate(d, rsList, podMap)
case apps.RollingUpdateDeploymentStrategyType:
	return dc.rolloutRolling(d, rsList)
}

rolloutRolling具体代码:

// rolloutRolling implements the logic for rolling a new replica set.
func (dc *DeploymentController) rolloutRolling(d *apps.Deployment, rsList []*apps.ReplicaSet) error {
	newRS, oldRSs, err := dc.getAllReplicaSetsAndSyncRevision(d, rsList, true)
	if err != nil {
		return err
	}
	allRSs := append(oldRSs, newRS)

	// Scale up, if we can.
	scaledUp, err := dc.reconcileNewReplicaSet(allRSs, newRS, d)
	if err != nil {
		return err
	}
	if scaledUp {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(allRSs, newRS, d)
	}

	// Scale down, if we can.
	scaledDown, err := dc.reconcileOldReplicaSets(allRSs, controller.FilterActiveReplicaSets(oldRSs), newRS, d)
	if err != nil {
		return err
	}
	if scaledDown {
		// Update DeploymentStatus
		return dc.syncRolloutStatus(allRSs, newRS, d)
	}

	if deploymentutil.DeploymentComplete(d, &d.Status) {
		if err := dc.cleanupDeployment(oldRSs, d); err != nil {
			return err
		}
	}

	// Sync deployment status
	return dc.syncRolloutStatus(allRSs, newRS, d)
}

5. 总结

startDeploymentController主要包括NewDeploymentControllerDeploymentController.Run两部分。

NewDeploymentController主要构建DeploymentController结构体。

该部分主要处理了以下逻辑:

  1. 构建并运行事件处理器eventBroadcaster
  2. 初始化赋值rsControlclientsetworkqueue
  3. 添加dInformerrsInformerpodInformerResourceEventHandlerFuncs,其中主要为AddFuncUpdateFuncDeleteFunc三类方法。
  4. 构造deployment、rs、pod的Informer的Lister函数和HasSynced函数。
  5. 赋值syncHandler,来实现syncDeployment

DeploymentController.Run主要包含WaitForCacheSyncsyncDeployment两部分。

syncDeployment基于给定的key执行sync deployment的操作。

主要流程如下:

  1. 通过SplitMetaNamespaceKey获取namespace和deployment对象的name。
  2. 调用Lister的接口获取的deployment的对象。
  3. getReplicaSetsForDeployment获取deployment管理的ReplicaSet对象。
  4. getPodMapForDeployment获取deployment管理的pod,基于ReplicaSet来分组。
  5. checkPausedConditions检查deployment是否是pause状态并添加合适的condition
  6. isScalingEvent检查deployment的更新是否来自于一个scale的事件,如果是则执行scale的操作。
  7. 根据DeploymentStrategyType类型执行rolloutRecreaterolloutRolling

参考:

11.3.2 -

kube-controller-manager源码分析(一)之 NewControllerManagerCommand

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析https://github.com/kubernetes/kubernetes/tree/v1.12.0/cmd/kube-controller-manager 部分的代码。

本文主要分析 kubernetes/cmd/kube-controller-manager部分,该部分主要涉及各种类型的controller的参数解析,及初始化,例如 deployment controllerstatefulset controller。并没有具体controller运行的详细逻辑,该部分位于kubernetes/pkg/controller模块,待后续文章分析。

kube-controller-managercmd部分代码目录结构如下:

kube-controller-manager
├── app
│   ├── apps.go   # 包含:startDeploymentController、startReplicaSetController、startStatefulSetController、startDaemonSetController
│   ├── autoscaling.go # startHPAController
│   ├── batch.go  # startJobController、startCronJobController
│   ├── bootstrap.go
│   ├── certificates.go
│   ├── cloudproviders.go
│   ├── config
│   │   └── config.go   # config: controller manager执行的上下文
│   ├── controllermanager.go   # 包含:NewControllerManagerCommand、Run、NewControllerInitializers、StartControllers等
│   ├── core.go   # startServiceController、startNodeIpamController、startPersistentVolumeBinderController、startNamespaceController等
│   ├── options    # 包含不同controller的option参数
│   │   ├── attachdetachcontroller.go
│   │   ├── csrsigningcontroller.go
│   │   ├── daemonsetcontroller.go   # DaemonSetControllerOptions
│   │   ├── deploymentcontroller.go  # DeploymentControllerOptions
│   │   ├── deprecatedcontroller.go
│   │   ├── endpointcontroller.go
│   │   ├── garbagecollectorcontroller.go
│   │   ├── hpacontroller.go
│   │   ├── jobcontroller.go
│   │   ├── namespacecontroller.go   # NamespaceControllerOptions
│   │   ├── nodeipamcontroller.go
│   │   ├── nodelifecyclecontroller.go
│   │   ├── options.go  # KubeControllerManagerOptions、NewKubeControllerManagerOptions
│   │   ├── persistentvolumebindercontroller.go
│   │   ├── podgccontroller.go
│   │   ├── replicasetcontroller.go   # ReplicaSetControllerOptions
│   │   ├── replicationcontroller.go
│   │   ├── resourcequotacontroller.go
│   │   ├── serviceaccountcontroller.go
│   │   └── ttlafterfinishedcontroller.go
└── controller-manager.go  # main入口函数

1. Main函数

kube-controller-manager的入口函数Main函数,仍然是采用统一的代码风格,使用Cobra命令行框架。

func main() {
	rand.Seed(time.Now().UTC().UnixNano())

	command := app.NewControllerManagerCommand()

	// TODO: once we switch everything over to Cobra commands, we can go back to calling
	// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
	// normalize func and add the go flag set by hand.
	pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)
	pflag.CommandLine.AddGoFlagSet(goflag.CommandLine)
	// utilflag.InitFlags()
	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}
}

核心代码:

// 初始化命令行结构体
command := app.NewControllerManagerCommand()
// 执行Execute
err := command.Execute()

2. NewControllerManagerCommand

该部分代码位于:kubernetes/cmd/kube-controller-manager/app/controllermanager.go

// NewControllerManagerCommand creates a *cobra.Command object with default parameters
func NewControllerManagerCommand() *cobra.Command {
	...
	cmd := &cobra.Command{
		Use: "kube-controller-manager",
		Long: `The Kubernetes controller manager is a daemon that embeds
the core control loops shipped with Kubernetes. In applications of robotics and
automation, a control loop is a non-terminating loop that regulates the state of
the system. In Kubernetes, a controller is a control loop that watches the shared
state of the cluster through the apiserver and makes changes attempting to move the
current state towards the desired state. Examples of controllers that ship with
Kubernetes today are the replication controller, endpoints controller, namespace
controller, and serviceaccounts controller.`,
		Run: func(cmd *cobra.Command, args []string) {
			verflag.PrintAndExitIfRequested()
			utilflag.PrintFlags(cmd.Flags())

			c, err := s.Config(KnownControllers(), ControllersDisabledByDefault.List())
			if err != nil {
				fmt.Fprintf(os.Stderr, "%v\n", err)
				os.Exit(1)
			}

			if err := Run(c.Complete(), wait.NeverStop); err != nil {
				fmt.Fprintf(os.Stderr, "%v\n", err)
				os.Exit(1)
			}
		},
	}
    ...
}    

构建一个*cobra.Command对象,然后执行Run函数。

2.1. NewKubeControllerManagerOptions

s, err := options.NewKubeControllerManagerOptions()
if err != nil {
	glog.Fatalf("unable to initialize command options: %v", err)
}

初始化controllerManager的参数,其中主要包括了各种controller的option,例如DeploymentControllerOptions:

// DeploymentControllerOptions holds the DeploymentController options.
type DeploymentControllerOptions struct {
	ConcurrentDeploymentSyncs      int32
	DeploymentControllerSyncPeriod metav1.Duration
}

具体代码如下:

// NewKubeControllerManagerOptions creates a new KubeControllerManagerOptions with a default config.
func NewKubeControllerManagerOptions() (*KubeControllerManagerOptions, error) {
	componentConfig, err := NewDefaultComponentConfig(ports.InsecureKubeControllerManagerPort)
	if err != nil {
		return nil, err
	}

	s := KubeControllerManagerOptions{
		Generic:         cmoptions.NewGenericControllerManagerConfigurationOptions(componentConfig.Generic),
		KubeCloudShared: cmoptions.NewKubeCloudSharedOptions(componentConfig.KubeCloudShared),
		AttachDetachController: &AttachDetachControllerOptions{
			ReconcilerSyncLoopPeriod: componentConfig.AttachDetachController.ReconcilerSyncLoopPeriod,
		},
		CSRSigningController: &CSRSigningControllerOptions{
			ClusterSigningCertFile: componentConfig.CSRSigningController.ClusterSigningCertFile,
			ClusterSigningKeyFile:  componentConfig.CSRSigningController.ClusterSigningKeyFile,
			ClusterSigningDuration: componentConfig.CSRSigningController.ClusterSigningDuration,
		},
		DaemonSetController: &DaemonSetControllerOptions{
			ConcurrentDaemonSetSyncs: componentConfig.DaemonSetController.ConcurrentDaemonSetSyncs,
		},
		DeploymentController: &DeploymentControllerOptions{
			ConcurrentDeploymentSyncs:      componentConfig.DeploymentController.ConcurrentDeploymentSyncs,
			DeploymentControllerSyncPeriod: componentConfig.DeploymentController.DeploymentControllerSyncPeriod,
		},
		DeprecatedFlags: &DeprecatedControllerOptions{
			RegisterRetryCount: componentConfig.DeprecatedController.RegisterRetryCount,
		},
		EndpointController: &EndpointControllerOptions{
			ConcurrentEndpointSyncs: componentConfig.EndpointController.ConcurrentEndpointSyncs,
		},
		GarbageCollectorController: &GarbageCollectorControllerOptions{
			ConcurrentGCSyncs:      componentConfig.GarbageCollectorController.ConcurrentGCSyncs,
			EnableGarbageCollector: componentConfig.GarbageCollectorController.EnableGarbageCollector,
		},
		HPAController: &HPAControllerOptions{
			HorizontalPodAutoscalerSyncPeriod:                   componentConfig.HPAController.HorizontalPodAutoscalerSyncPeriod,
			HorizontalPodAutoscalerUpscaleForbiddenWindow:       componentConfig.HPAController.HorizontalPodAutoscalerUpscaleForbiddenWindow,
			HorizontalPodAutoscalerDownscaleForbiddenWindow:     componentConfig.HPAController.HorizontalPodAutoscalerDownscaleForbiddenWindow,
			HorizontalPodAutoscalerDownscaleStabilizationWindow: componentConfig.HPAController.HorizontalPodAutoscalerDownscaleStabilizationWindow,
			HorizontalPodAutoscalerCPUInitializationPeriod:      componentConfig.HPAController.HorizontalPodAutoscalerCPUInitializationPeriod,
			HorizontalPodAutoscalerInitialReadinessDelay:        componentConfig.HPAController.HorizontalPodAutoscalerInitialReadinessDelay,
			HorizontalPodAutoscalerTolerance:                    componentConfig.HPAController.HorizontalPodAutoscalerTolerance,
			HorizontalPodAutoscalerUseRESTClients:               componentConfig.HPAController.HorizontalPodAutoscalerUseRESTClients,
		},
		JobController: &JobControllerOptions{
			ConcurrentJobSyncs: componentConfig.JobController.ConcurrentJobSyncs,
		},
		NamespaceController: &NamespaceControllerOptions{
			NamespaceSyncPeriod:      componentConfig.NamespaceController.NamespaceSyncPeriod,
			ConcurrentNamespaceSyncs: componentConfig.NamespaceController.ConcurrentNamespaceSyncs,
		},
		NodeIPAMController: &NodeIPAMControllerOptions{
			NodeCIDRMaskSize: componentConfig.NodeIPAMController.NodeCIDRMaskSize,
		},
		NodeLifecycleController: &NodeLifecycleControllerOptions{
			EnableTaintManager:     componentConfig.NodeLifecycleController.EnableTaintManager,
			NodeMonitorGracePeriod: componentConfig.NodeLifecycleController.NodeMonitorGracePeriod,
			NodeStartupGracePeriod: componentConfig.NodeLifecycleController.NodeStartupGracePeriod,
			PodEvictionTimeout:     componentConfig.NodeLifecycleController.PodEvictionTimeout,
		},
		PersistentVolumeBinderController: &PersistentVolumeBinderControllerOptions{
			PVClaimBinderSyncPeriod: componentConfig.PersistentVolumeBinderController.PVClaimBinderSyncPeriod,
			VolumeConfiguration:     componentConfig.PersistentVolumeBinderController.VolumeConfiguration,
		},
		PodGCController: &PodGCControllerOptions{
			TerminatedPodGCThreshold: componentConfig.PodGCController.TerminatedPodGCThreshold,
		},
		ReplicaSetController: &ReplicaSetControllerOptions{
			ConcurrentRSSyncs: componentConfig.ReplicaSetController.ConcurrentRSSyncs,
		},
		ReplicationController: &ReplicationControllerOptions{
			ConcurrentRCSyncs: componentConfig.ReplicationController.ConcurrentRCSyncs,
		},
		ResourceQuotaController: &ResourceQuotaControllerOptions{
			ResourceQuotaSyncPeriod:      componentConfig.ResourceQuotaController.ResourceQuotaSyncPeriod,
			ConcurrentResourceQuotaSyncs: componentConfig.ResourceQuotaController.ConcurrentResourceQuotaSyncs,
		},
		SAController: &SAControllerOptions{
			ConcurrentSATokenSyncs: componentConfig.SAController.ConcurrentSATokenSyncs,
		},
		ServiceController: &cmoptions.ServiceControllerOptions{
			ConcurrentServiceSyncs: componentConfig.ServiceController.ConcurrentServiceSyncs,
		},
		TTLAfterFinishedController: &TTLAfterFinishedControllerOptions{
			ConcurrentTTLSyncs: componentConfig.TTLAfterFinishedController.ConcurrentTTLSyncs,
		},
		SecureServing: apiserveroptions.NewSecureServingOptions().WithLoopback(),
		InsecureServing: (&apiserveroptions.DeprecatedInsecureServingOptions{
			BindAddress: net.ParseIP(componentConfig.Generic.Address),
			BindPort:    int(componentConfig.Generic.Port),
			BindNetwork: "tcp",
		}).WithLoopback(),
		Authentication: apiserveroptions.NewDelegatingAuthenticationOptions(),
		Authorization:  apiserveroptions.NewDelegatingAuthorizationOptions(),
	}

	s.Authentication.RemoteKubeConfigFileOptional = true
	s.Authorization.RemoteKubeConfigFileOptional = true
	s.Authorization.AlwaysAllowPaths = []string{"/healthz"}

	s.SecureServing.ServerCert.CertDirectory = "/var/run/kubernetes"
	s.SecureServing.ServerCert.PairName = "kube-controller-manager"
	s.SecureServing.BindPort = ports.KubeControllerManagerPort

	gcIgnoredResources := make([]kubectrlmgrconfig.GroupResource, 0, len(garbagecollector.DefaultIgnoredResources()))
	for r := range garbagecollector.DefaultIgnoredResources() {
		gcIgnoredResources = append(gcIgnoredResources, kubectrlmgrconfig.GroupResource{Group: r.Group, Resource: r.Resource})
	}

	s.GarbageCollectorController.GCIgnoredResources = gcIgnoredResources

	return &s, nil
}

2.2. AddFlagSet

添加参数及帮助函数。

fs := cmd.Flags()
namedFlagSets := s.Flags(KnownControllers(), ControllersDisabledByDefault.List())
for _, f := range namedFlagSets.FlagSets {
	fs.AddFlagSet(f)
}
usageFmt := "Usage:\n  %s\n"
cols, _, _ := apiserverflag.TerminalSize(cmd.OutOrStdout())
cmd.SetUsageFunc(func(cmd *cobra.Command) error {
	fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
	apiserverflag.PrintSections(cmd.OutOrStderr(), namedFlagSets, cols)
	return nil
})
cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
	fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
	apiserverflag.PrintSections(cmd.OutOrStdout(), namedFlagSets, cols)
})

3. Run

此部分的代码位于cmd/kube-controller-manager/app/controllermanager.go

基于KubeControllerManagerOptions运行controllerManager,不退出。

// Run runs the KubeControllerManagerOptions.  This should never exit.
func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
	...
	run := func(ctx context.Context) {
		...
		controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())
		if err != nil {
			glog.Fatalf("error building controller context: %v", err)
		}
		saTokenControllerInitFunc := serviceAccountTokenControllerStarter{rootClientBuilder: rootClientBuilder}.startServiceAccountTokenController

		if err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux); err != nil {
			glog.Fatalf("error starting controllers: %v", err)
		}

		controllerContext.InformerFactory.Start(controllerContext.Stop)
		close(controllerContext.InformersStarted)

		select {}
	}
	...
}

Run函数涉及的核心代码如下:

// 创建controller的context
controllerContext, err := CreateControllerContext(c, rootClientBuilder, clientBuilder, ctx.Done())
// 启动各种controller
err := StartControllers(controllerContext, saTokenControllerInitFunc, NewControllerInitializers(controllerContext.LoopMode), unsecuredMux)

其中StartControllers中的入参NewControllerInitializers初始化了各种controller。

3.1. CreateControllerContext

CreateControllerContext构建了各种controller所需的资源的上下文,各种controller在启动时,入参为该context,具体参考initFn(ctx)

// CreateControllerContext creates a context struct containing references to resources needed by the
// controllers such as the cloud provider and clientBuilder. rootClientBuilder is only used for
// the shared-informers client and token controller.
func CreateControllerContext(s *config.CompletedConfig, rootClientBuilder, clientBuilder controller.ControllerClientBuilder, stop <-chan struct{}) (ControllerContext, error) {
	versionedClient := rootClientBuilder.ClientOrDie("shared-informers")
	sharedInformers := informers.NewSharedInformerFactory(versionedClient, ResyncPeriod(s)())

	// If apiserver is not running we should wait for some time and fail only then. This is particularly
	// important when we start apiserver and controller manager at the same time.
	if err := genericcontrollermanager.WaitForAPIServer(versionedClient, 10*time.Second); err != nil {
		return ControllerContext{}, fmt.Errorf("failed to wait for apiserver being healthy: %v", err)
	}

	// Use a discovery client capable of being refreshed.
	discoveryClient := rootClientBuilder.ClientOrDie("controller-discovery")
	cachedClient := cacheddiscovery.NewMemCacheClient(discoveryClient.Discovery())
	restMapper := restmapper.NewDeferredDiscoveryRESTMapper(cachedClient)
	go wait.Until(func() {
		restMapper.Reset()
	}, 30*time.Second, stop)

	availableResources, err := GetAvailableResources(rootClientBuilder)
	if err != nil {
		return ControllerContext{}, err
	}

	cloud, loopMode, err := createCloudProvider(s.ComponentConfig.KubeCloudShared.CloudProvider.Name, s.ComponentConfig.KubeCloudShared.ExternalCloudVolumePlugin,
		s.ComponentConfig.KubeCloudShared.CloudProvider.CloudConfigFile, s.ComponentConfig.KubeCloudShared.AllowUntaggedCloud, sharedInformers)
	if err != nil {
		return ControllerContext{}, err
	}

	ctx := ControllerContext{
		ClientBuilder:      clientBuilder,
		InformerFactory:    sharedInformers,
		ComponentConfig:    s.ComponentConfig,
		RESTMapper:         restMapper,
		AvailableResources: availableResources,
		Cloud:              cloud,
		LoopMode:           loopMode,
		Stop:               stop,
		InformersStarted:   make(chan struct{}),
		ResyncPeriod:       ResyncPeriod(s),
	}
	return ctx, nil
}

核心代码为NewSharedInformerFactory

// 创建SharedInformerFactory
sharedInformers := informers.NewSharedInformerFactory(versionedClient, ResyncPeriod(s)())
// 赋值给ControllerContext
ctx := ControllerContext{
	InformerFactory:    sharedInformers,
}

SharedInformerFactory提供了公共的k8s对象的informers

// SharedInformerFactory provides shared informers for resources in all known
// API group versions.
type SharedInformerFactory interface {
	internalinterfaces.SharedInformerFactory
	ForResource(resource schema.GroupVersionResource) (GenericInformer, error)
	WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool

	Admissionregistration() admissionregistration.Interface
	Apps() apps.Interface
	Autoscaling() autoscaling.Interface
	Batch() batch.Interface
	Certificates() certificates.Interface
	Coordination() coordination.Interface
	Core() core.Interface
	Events() events.Interface
	Extensions() extensions.Interface
	Networking() networking.Interface
	Policy() policy.Interface
	Rbac() rbac.Interface
	Scheduling() scheduling.Interface
	Settings() settings.Interface
	Storage() storage.Interface
}

3.2. NewControllerInitializers

NewControllerInitializers定义了各种controller的类型和其对于的启动函数,例如deployment``、statefulsetreplicasetreplicationcontrollernamespace等。

// NewControllerInitializers is a public map of named controller groups (you can start more than one in an init func)
// paired to their InitFunc.  This allows for structured downstream composition and subdivision.
func NewControllerInitializers(loopMode ControllerLoopMode) map[string]InitFunc {
	controllers := map[string]InitFunc{}
	controllers["endpoint"] = startEndpointController
	controllers["replicationcontroller"] = startReplicationController
	controllers["podgc"] = startPodGCController
	controllers["resourcequota"] = startResourceQuotaController
	controllers["namespace"] = startNamespaceController
	controllers["serviceaccount"] = startServiceAccountController
	controllers["garbagecollector"] = startGarbageCollectorController
	controllers["daemonset"] = startDaemonSetController
	controllers["job"] = startJobController
	controllers["deployment"] = startDeploymentController
	controllers["replicaset"] = startReplicaSetController
	controllers["horizontalpodautoscaling"] = startHPAController
	controllers["disruption"] = startDisruptionController
	controllers["statefulset"] = startStatefulSetController
	controllers["cronjob"] = startCronJobController
	controllers["csrsigning"] = startCSRSigningController
	controllers["csrapproving"] = startCSRApprovingController
	controllers["csrcleaner"] = startCSRCleanerController
	controllers["ttl"] = startTTLController
	controllers["bootstrapsigner"] = startBootstrapSignerController
	controllers["tokencleaner"] = startTokenCleanerController
	controllers["nodeipam"] = startNodeIpamController
	if loopMode == IncludeCloudLoops {
		controllers["service"] = startServiceController
		controllers["route"] = startRouteController
		// TODO: volume controller into the IncludeCloudLoops only set.
		// TODO: Separate cluster in cloud check from node lifecycle controller.
	}
	controllers["nodelifecycle"] = startNodeLifecycleController
	controllers["persistentvolume-binder"] = startPersistentVolumeBinderController
	controllers["attachdetach"] = startAttachDetachController
	controllers["persistentvolume-expander"] = startVolumeExpandController
	controllers["clusterrole-aggregation"] = startClusterRoleAggregrationController
	controllers["pvc-protection"] = startPVCProtectionController
	controllers["pv-protection"] = startPVProtectionController
	controllers["ttl-after-finished"] = startTTLAfterFinishedController

	return controllers
}

3.3. StartControllers

func StartControllers(ctx ControllerContext, startSATokenController InitFunc, controllers map[string]InitFunc, unsecuredMux *mux.PathRecorderMux) error {
	...
	for controllerName, initFn := range controllers {
		if !ctx.IsControllerEnabled(controllerName) {
			glog.Warningf("%q is disabled", controllerName)
			continue
		}
		time.Sleep(wait.Jitter(ctx.ComponentConfig.Generic.ControllerStartInterval.Duration, ControllerStartJitter))

		glog.V(1).Infof("Starting %q", controllerName)
		debugHandler, started, err := initFn(ctx)
		if err != nil {
			glog.Errorf("Error starting %q", controllerName)
			return err
		}
		if !started {
			glog.Warningf("Skipping %q", controllerName)
			continue
		}
		if debugHandler != nil && unsecuredMux != nil {
			basePath := "/debug/controllers/" + controllerName
			unsecuredMux.UnlistedHandle(basePath, http.StripPrefix(basePath, debugHandler))
			unsecuredMux.UnlistedHandlePrefix(basePath+"/", http.StripPrefix(basePath, debugHandler))
		}
		glog.Infof("Started %q", controllerName)
	}

	return nil
}

核心代码:

for controllerName, initFn := range controllers {
	debugHandler, started, err := initFn(ctx)
}   

启动各种controller,controller的启动函数在NewControllerInitializers中定义了,例如:

// deployment
controllers["deployment"] = startDeploymentController
// statefulset
controllers["statefulset"] = startStatefulSetController

3.4. InformerFactory.Start

InformerFactory实际上是SharedInformerFactory,具体的实现逻辑在client-go中的informer的实现机制。

controllerContext.InformerFactory.Start(controllerContext.Stop)
close(controllerContext.InformersStarted)

3.4.1. SharedInformerFactory

SharedInformerFactory是一个informer工厂的接口定义。

// SharedInformerFactory a small interface to allow for adding an informer without an import cycle
type SharedInformerFactory interface {
	Start(stopCh <-chan struct{})
	InformerFor(obj runtime.Object, newFunc NewInformerFunc) cache.SharedIndexInformer
}

3.4.2. sharedInformerFactory.Start

Start方法初始化各种类型的informer

// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
	f.lock.Lock()
	defer f.lock.Unlock()

	for informerType, informer := range f.informers {
		if !f.startedInformers[informerType] {
			go informer.Run(stopCh)
			f.startedInformers[informerType] = true
		}
	}
}

3.4.3. sharedIndexInformer.Run

sharedIndexInformer.Run具体运行了sharedIndexInformer的实现逻辑,该部分待后续对informer机制做专题分析。

func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()

	fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)

	cfg := &Config{
		Queue:            fifo,
		ListerWatcher:    s.listerWatcher,
		ObjectType:       s.objectType,
		FullResyncPeriod: s.resyncCheckPeriod,
		RetryOnError:     false,
		ShouldResync:     s.processor.shouldResync,

		Process: s.HandleDeltas,
	}

	func() {
		s.startedLock.Lock()
		defer s.startedLock.Unlock()

		s.controller = New(cfg)
		s.controller.(*controller).clock = s.clock
		s.started = true
	}()

	// Separate stop channel because Processor should be stopped strictly after controller
	processorStopCh := make(chan struct{})
	var wg wait.Group
	defer wg.Wait()              // Wait for Processor to stop
	defer close(processorStopCh) // Tell Processor to stop
	wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)
	wg.StartWithChannel(processorStopCh, s.processor.run)

	defer func() {
		s.startedLock.Lock()
		defer s.startedLock.Unlock()
		s.stopped = true // Don't want any new listeners
	}()
	s.controller.Run(stopCh)
}

4. initFn(ctx)

initFn实际调用的就是各种类型的controller,代码位于kubernetes/cmd/kube-controller-manager/app/apps.go,本文以startStatefulSetControllerstartDeploymentController为例,controller中实际调用的函数逻辑位于kubernetes/pkg/controller中,待后续分析。

4.1. startStatefulSetController

func startStatefulSetController(ctx ControllerContext) (http.Handler, bool, error) {
	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "statefulsets"}] {
		return nil, false, nil
	}
	go statefulset.NewStatefulSetController(
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.InformerFactory.Apps().V1().StatefulSets(),
		ctx.InformerFactory.Core().V1().PersistentVolumeClaims(),
		ctx.InformerFactory.Apps().V1().ControllerRevisions(),
		ctx.ClientBuilder.ClientOrDie("statefulset-controller"),
	).Run(1, ctx.Stop)
	return nil, true, nil
}

其中使用到了InformerFactory,包含了Pods、StatefulSets、PersistentVolumeClaims、ControllerRevisions的informer。

startStatefulSetController主要调用的函数为NewStatefulSetController和对应的Run函数。

4.2. startDeploymentController

func startDeploymentController(ctx ControllerContext) (http.Handler, bool, error) {
	if !ctx.AvailableResources[schema.GroupVersionResource{Group: "apps", Version: "v1", Resource: "deployments"}] {
		return nil, false, nil
	}
	dc, err := deployment.NewDeploymentController(
		ctx.InformerFactory.Apps().V1().Deployments(),
		ctx.InformerFactory.Apps().V1().ReplicaSets(),
		ctx.InformerFactory.Core().V1().Pods(),
		ctx.ClientBuilder.ClientOrDie("deployment-controller"),
	)
	if err != nil {
		return nil, true, fmt.Errorf("error creating Deployment controller: %v", err)
	}
	go dc.Run(int(ctx.ComponentConfig.DeploymentController.ConcurrentDeploymentSyncs), ctx.Stop)
	return nil, true, nil
}

startDeploymentController主要调用的函数为NewDeploymentController和对应的Run函数。该部分逻辑在kubernetes/pkg/controller中。

5. 总结

  1. Kube-controller-manager的代码风格仍然是Cobra命令行框架。通过构造ControllerManagerCommand,然后执行command.Execute()函数。基本的流程就是构造option,添加Flags,执行Run函数。
  2. cmd部分的调用流程如下:Main-->NewControllerManagerCommand--> Run(c.Complete(), wait.NeverStop)-->StartControllers-->initFn(ctx)-->startDeploymentController/startStatefulSetController-->sts.NewStatefulSetController.Run/dc.NewDeploymentController.Run-->pkg/controller
  3. 其中CreateControllerContext函数用来创建各类型controller所需要使用的context,NewControllerInitializers初始化了各种类型的controller,其中就包括DeploymentControllerStatefulSetController等。

基本流程如下:

  1. 构造controller manager option,并转化为Config对象,执行Run函数。
  2. 基于Config对象创建ControllerContext,其中包含InformerFactory。
  3. 基于ControllerContext运行各种controller,各种controller的定义在NewControllerInitializers中。
  4. 执行InformerFactory.Start。
  5. 每种controller都会构造自身的结构体并执行对应的Run函数。

参考:

11.3.3 -

kube-controller-manager源码分析(三)之 Informer机制

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析k8s中各个核心组件经常使用到的Informer机制(即List-Watch)。该部分的代码主要位于client-go这个第三方包中。

此部分的逻辑主要位于/vendor/k8s.io/client-go/tools/cache包中,代码目录结构如下:

cache
├── controller.go  # 包含:Config、Run、processLoop、NewInformer、NewIndexerInformer
├── delta_fifo.go  # 包含:NewDeltaFIFO、DeltaFIFO、AddIfNotPresent
├── expiration_cache.go
├── expiration_cache_fakes.go
├── fake_custom_store.go
├── fifo.go   # 包含:Queue、FIFO、NewFIFO
├── heap.go
├── index.go    # 包含:Indexer、MetaNamespaceIndexFunc
├── listers.go
├── listwatch.go   # 包含:ListerWatcher、ListWatch、List、Watch
├── mutation_cache.go
├── mutation_detector.go
├── reflector.go   # 包含:Reflector、NewReflector、Run、ListAndWatch
├── reflector_metrics.go
├── shared_informer.go  # 包含:NewSharedInformer、WaitForCacheSync、Run、HasSynced
├── store.go  # 包含:Store、MetaNamespaceKeyFunc、SplitMetaNamespaceKey
├── testing
│   ├── fake_controller_source.go
├── thread_safe_store.go  # 包含:ThreadSafeStore、threadSafeMap
├── undelta_store.go

0. 原理示意图

示意图1

示意图2

0.1. client-go组件

  • Reflector:reflector用来watch特定的k8s API资源。具体的实现是通过ListAndWatch的方法,watch可以是k8s内建的资源或者是自定义的资源。当reflector通过watch API接收到有关新资源实例存在的通知时,它使用相应的列表API获取新创建的对象,并将其放入watchHandler函数内的Delta Fifo队列中。

  • Informer:informer从Delta Fifo队列中弹出对象。执行此操作的功能是processLoop。base controller的作用是保存对象以供以后检索,并调用我们的控制器将对象传递给它。

  • Indexer:索引器提供对象的索引功能。典型的索引用例是基于对象标签创建索引。 Indexer可以根据多个索引函数维护索引。Indexer使用线程安全的数据存储来存储对象及其键。 在Store中定义了一个名为MetaNamespaceKeyFunc的默认函数,该函数生成对象的键作为该对象的<namespace> / <name>组合。

0.2. 自定义controller组件

  • Informer reference:指的是Informer实例的引用,定义如何使用自定义资源对象。 自定义控制器代码需要创建对应的Informer。

  • Indexer reference: 自定义控制器对Indexer实例的引用。自定义控制器需要创建对应的Indexser。

client-go中提供NewIndexerInformer函数可以创建Informer 和 Indexer。

  • Resource Event Handlers:资源事件回调函数,当它想要将对象传递给控制器时,它将被调用。 编写这些函数的典型模式是获取调度对象的key,并将该key排入工作队列以进行进一步处理。

  • Work queue:任务队列。 编写资源事件处理程序函数以提取传递的对象的key并将其添加到任务队列。

  • Process Item:处理任务队列中对象的函数, 这些函数通常使用Indexer引用或Listing包装器来重试与该key对应的对象。

1. sharedInformerFactory.Start

在controller-manager的Run函数部分调用了InformerFactory.Start的方法。

此部分代码位于/cmd/kube-controller-manager/app/controllermanager.go

// Run runs the KubeControllerManagerOptions.  This should never exit.
func Run(c *config.CompletedConfig, stopCh <-chan struct{}) error {
    ...
		controllerContext.InformerFactory.Start(controllerContext.Stop)
		close(controllerContext.InformersStarted)
    ...
}

InformerFactory是一个SharedInformerFactory的接口,接口定义如下:

此部分代码位于vendor/k8s.io/client-go/informers/internalinterfaces/factory_interfaces.go

// SharedInformerFactory a small interface to allow for adding an informer without an import cycle
type SharedInformerFactory interface {
	Start(stopCh <-chan struct{})
	InformerFor(obj runtime.Object, newFunc NewInformerFunc) cache.SharedIndexInformer
}

Start方法初始化各种类型的informer,并且每个类型起了个informer.Run的goroutine。

此部分代码位于vendor/k8s.io/client-go/informers/factory.go

// Start initializes all requested informers.
func (f *sharedInformerFactory) Start(stopCh <-chan struct{}) {
	f.lock.Lock()
	defer f.lock.Unlock()

	for informerType, informer := range f.informers {
		if !f.startedInformers[informerType] {
			go informer.Run(stopCh)
			f.startedInformers[informerType] = true
		}
	}
}

2. sharedIndexInformer.Run

此部分的代码位于/vendor/k8s.io/client-go/tools/cache/shared_informer.go

func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()

	fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)

	cfg := &Config{
		Queue:            fifo,
		ListerWatcher:    s.listerWatcher,
		ObjectType:       s.objectType,
		FullResyncPeriod: s.resyncCheckPeriod,
		RetryOnError:     false,
		ShouldResync:     s.processor.shouldResync,

		Process: s.HandleDeltas,
	}

	func() {
		s.startedLock.Lock()
		defer s.startedLock.Unlock()

		s.controller = New(cfg)
		s.controller.(*controller).clock = s.clock
		s.started = true
	}()

	// Separate stop channel because Processor should be stopped strictly after controller
	processorStopCh := make(chan struct{})
	var wg wait.Group
	defer wg.Wait()              // Wait for Processor to stop
	defer close(processorStopCh) // Tell Processor to stop
	wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)
	wg.StartWithChannel(processorStopCh, s.processor.run)

	defer func() {
		s.startedLock.Lock()
		defer s.startedLock.Unlock()
		s.stopped = true // Don't want any new listeners
	}()
	s.controller.Run(stopCh)
}

2.1. NewDeltaFIFO

DeltaFIFO是一个对象变化的存储队列,依据先进先出的原则,process的函数接收该队列的Pop方法的输出对象来处理相关功能。

fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)

2.2. Config

构造controller的配置文件,构造process,即HandleDeltas,该函数为后面使用到的process函数。

cfg := &Config{
	Queue:            fifo,
	ListerWatcher:    s.listerWatcher,
	ObjectType:       s.objectType,
	FullResyncPeriod: s.resyncCheckPeriod,
	RetryOnError:     false,
	ShouldResync:     s.processor.shouldResync,

	Process: s.HandleDeltas,
}

2.3. controller

调用New(cfg),构建sharedIndexInformer的controller。

func() {
	s.startedLock.Lock()
	defer s.startedLock.Unlock()

	s.controller = New(cfg)
	s.controller.(*controller).clock = s.clock
	s.started = true
}()

2.4. cacheMutationDetector.Run

调用s.cacheMutationDetector.Run,检查缓存对象是否变化。

wg.StartWithChannel(processorStopCh, s.cacheMutationDetector.Run)

defaultCacheMutationDetector.Run

func (d *defaultCacheMutationDetector) Run(stopCh <-chan struct{}) {
	// we DON'T want protection from panics.  If we're running this code, we want to die
	for {
		d.CompareObjects()

		select {
		case <-stopCh:
			return
		case <-time.After(d.period):
		}
	}
}

CompareObjects

func (d *defaultCacheMutationDetector) CompareObjects() {
	d.lock.Lock()
	defer d.lock.Unlock()

	altered := false
	for i, obj := range d.cachedObjs {
		if !reflect.DeepEqual(obj.cached, obj.copied) {
			fmt.Printf("CACHE %s[%d] ALTERED!\n%v\n", d.name, i, diff.ObjectDiff(obj.cached, obj.copied))
			altered = true
		}
	}

	if altered {
		msg := fmt.Sprintf("cache %s modified", d.name)
		if d.failureFunc != nil {
			d.failureFunc(msg)
			return
		}
		panic(msg)
	}
}

2.5. processor.run

调用s.processor.run,将调用sharedProcessor.run,会调用Listener.run和Listener.pop,执行处理queue的函数。

wg.StartWithChannel(processorStopCh, s.processor.run)

sharedProcessor.Run

func (p *sharedProcessor) run(stopCh <-chan struct{}) {
	func() {
		p.listenersLock.RLock()
		defer p.listenersLock.RUnlock()
		for _, listener := range p.listeners {
			p.wg.Start(listener.run)
			p.wg.Start(listener.pop)
		}
	}()
	<-stopCh
	p.listenersLock.RLock()
	defer p.listenersLock.RUnlock()
	for _, listener := range p.listeners {
		close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop
	}
	p.wg.Wait() // Wait for all .pop() and .run() to stop
}

该部分逻辑待后面分析。

2.6. controller.Run

调用s.controller.Run,构建Reflector,进行对etcd的缓存

defer func() {
	s.startedLock.Lock()
	defer s.startedLock.Unlock()
	s.stopped = true // Don't want any new listeners
}()
s.controller.Run(stopCh)

controller.Run

此部分代码位于/vendor/k8s.io/client-go/tools/cache/controller.go

// Run begins processing items, and will continue until a value is sent down stopCh.
// It's an error to call Run more than once.
// Run blocks; call via go.
func (c *controller) Run(stopCh <-chan struct{}) {
	defer utilruntime.HandleCrash()
	go func() {
		<-stopCh
		c.config.Queue.Close()
	}()
	r := NewReflector(
		c.config.ListerWatcher,
		c.config.ObjectType,
		c.config.Queue,
		c.config.FullResyncPeriod,
	)
	r.ShouldResync = c.config.ShouldResync
	r.clock = c.clock

	c.reflectorMutex.Lock()
	c.reflector = r
	c.reflectorMutex.Unlock()

	var wg wait.Group
	defer wg.Wait()

	wg.StartWithChannel(stopCh, r.Run)

	wait.Until(c.processLoop, time.Second, stopCh)
}

核心代码:

// 构建Reflector
r := NewReflector(
	c.config.ListerWatcher,
	c.config.ObjectType,
	c.config.Queue,
	c.config.FullResyncPeriod,
)
// 运行Reflector
wg.StartWithChannel(stopCh, r.Run)
// 执行processLoop
wait.Until(c.processLoop, time.Second, stopCh)

3. Reflector

3.1. Reflector

Reflector的主要作用是watch指定的k8s资源,并将变化同步到本地是store中。Reflector只会放置指定的expectedType类型的资源到store中,除非expectedType为nil。如果resyncPeriod不为零,那么Reflector为以resyncPeriod为周期定期执行list的操作,这样就可以使用Reflector来定期处理所有的对象,也可以逐步处理变化的对象。

常用属性说明:

  • expectedType:期望放入缓存store的资源类型。
  • store:watch的资源对应的本地缓存。
  • listerWatcher:list和watch的接口。
  • period:watch的周期,默认为1秒。
  • resyncPeriod:resync的周期,当非零的时候,会按该周期执行list。
  • lastSyncResourceVersion:最新一次看到的资源的版本号,主要在watch时候使用。
// Reflector watches a specified resource and causes all changes to be reflected in the given store.
type Reflector struct {
	// name identifies this reflector. By default it will be a file:line if possible.
	name string
	// metrics tracks basic metric information about the reflector
	metrics *reflectorMetrics

	// The type of object we expect to place in the store.
	expectedType reflect.Type
	// The destination to sync up with the watch source
	store Store
	// listerWatcher is used to perform lists and watches.
	listerWatcher ListerWatcher
	// period controls timing between one watch ending and
	// the beginning of the next one.
	period       time.Duration
	resyncPeriod time.Duration
	ShouldResync func() bool
	// clock allows tests to manipulate time
	clock clock.Clock
	// lastSyncResourceVersion is the resource version token last
	// observed when doing a sync with the underlying store
	// it is thread safe, but not synchronized with the underlying store
	lastSyncResourceVersion string
	// lastSyncResourceVersionMutex guards read/write access to lastSyncResourceVersion
	lastSyncResourceVersionMutex sync.RWMutex
}

3.2. NewReflector

NewReflector主要用来构建Reflector的结构体。

此部分的代码位于/vendor/k8s.io/client-go/tools/cache/reflector.go

// NewReflector creates a new Reflector object which will keep the given store up to
// date with the server's contents for the given resource. Reflector promises to
// only put things in the store that have the type of expectedType, unless expectedType
// is nil. If resyncPeriod is non-zero, then lists will be executed after every
// resyncPeriod, so that you can use reflectors to periodically process everything as
// well as incrementally processing the things that change.
func NewReflector(lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
	return NewNamedReflector(getDefaultReflectorName(internalPackages...), lw, expectedType, store, resyncPeriod)
}

// reflectorDisambiguator is used to disambiguate started reflectors.
// initialized to an unstable value to ensure meaning isn't attributed to the suffix.
var reflectorDisambiguator = int64(time.Now().UnixNano() % 12345)

// NewNamedReflector same as NewReflector, but with a specified name for logging
func NewNamedReflector(name string, lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
	reflectorSuffix := atomic.AddInt64(&reflectorDisambiguator, 1)
	r := &Reflector{
		name: name,
		// we need this to be unique per process (some names are still the same)but obvious who it belongs to
		metrics:       newReflectorMetrics(makeValidPromethusMetricLabel(fmt.Sprintf("reflector_"+name+"_%d", reflectorSuffix))),
		listerWatcher: lw,
		store:         store,
		expectedType:  reflect.TypeOf(expectedType),
		period:        time.Second,
		resyncPeriod:  resyncPeriod,
		clock:         &clock.RealClock{},
	}
	return r
}

3.3. Reflector.Run

Reflector.Run主要执行了ListAndWatch的方法。

// Run starts a watch and handles watch events. Will restart the watch if it is closed.
// Run will exit when stopCh is closed.
func (r *Reflector) Run(stopCh <-chan struct{}) {
	glog.V(3).Infof("Starting reflector %v (%s) from %s", r.expectedType, r.resyncPeriod, r.name)
	wait.Until(func() {
		if err := r.ListAndWatch(stopCh); err != nil {
			utilruntime.HandleError(err)
		}
	}, r.period, stopCh)
}

3.4. ListAndWatch

ListAndWatch第一次会列出所有的对象,并获取资源对象的版本号,然后watch资源对象的版本号来查看是否有被变更。首先会将资源版本号设置为0,list()可能会导致本地的缓存相对于etcd里面的内容存在延迟,Reflector会通过watch的方法将延迟的部分补充上,使得本地的缓存数据与etcd的数据保持一致。

3.4.1. List

// ListAndWatch first lists all items and get the resource version at the moment of call,
// and then use the resource version to watch.
// It returns error if ListAndWatch didn't even try to initialize watch.
func (r *Reflector) ListAndWatch(stopCh <-chan struct{}) error {
	glog.V(3).Infof("Listing and watching %v from %s", r.expectedType, r.name)
	var resourceVersion string

	// Explicitly set "0" as resource version - it's fine for the List()
	// to be served from cache and potentially be delayed relative to
	// etcd contents. Reflector framework will catch up via Watch() eventually.
	options := metav1.ListOptions{ResourceVersion: "0"}
	r.metrics.numberOfLists.Inc()
	start := r.clock.Now()
	list, err := r.listerWatcher.List(options)
	if err != nil {
		return fmt.Errorf("%s: Failed to list %v: %v", r.name, r.expectedType, err)
	}
	r.metrics.listDuration.Observe(time.Since(start).Seconds())
	listMetaInterface, err := meta.ListAccessor(list)
	if err != nil {
		return fmt.Errorf("%s: Unable to understand list result %#v: %v", r.name, list, err)
	}
	resourceVersion = listMetaInterface.GetResourceVersion()
	items, err := meta.ExtractList(list)
	if err != nil {
		return fmt.Errorf("%s: Unable to understand list result %#v (%v)", r.name, list, err)
	}
	r.metrics.numberOfItemsInList.Observe(float64(len(items)))
	if err := r.syncWith(items, resourceVersion); err != nil {
		return fmt.Errorf("%s: Unable to sync list result: %v", r.name, err)
	}
	r.setLastSyncResourceVersion(resourceVersion)
    ...
}    

首先将资源的版本号设置为0,然后调用listerWatcher.List(options),列出所有list的内容。

// 版本号设置为0
options := metav1.ListOptions{ResourceVersion: "0"}
// list接口
list, err := r.listerWatcher.List(options)

获取资源版本号,并将list的内容提取成对象列表。

// 获取版本号
resourceVersion = listMetaInterface.GetResourceVersion()
// 将list的内容提取成对象列表
items, err := meta.ExtractList(list)

将list中对象列表的内容和版本号存储到本地的缓存store中,并全量替换已有的store的内容。

err := r.syncWith(items, resourceVersion)

syncWith调用了store的Replace的方法来替换原来store中的数据。

// syncWith replaces the store's items with the given list.
func (r *Reflector) syncWith(items []runtime.Object, resourceVersion string) error {
	found := make([]interface{}, 0, len(items))
	for _, item := range items {
		found = append(found, item)
	}
	return r.store.Replace(found, resourceVersion)
}

Store.Replace方法定义如下:

type Store interface {
	...
	// Replace will delete the contents of the store, using instead the
	// given list. Store takes ownership of the list, you should not reference
	// it after calling this function.
	Replace([]interface{}, string) error
    ...
}

最后设置最新的资源版本号。

r.setLastSyncResourceVersion(resourceVersion)

setLastSyncResourceVersion:

func (r *Reflector) setLastSyncResourceVersion(v string) {
	r.lastSyncResourceVersionMutex.Lock()
	defer r.lastSyncResourceVersionMutex.Unlock()
	r.lastSyncResourceVersion = v

	rv, err := strconv.Atoi(v)
	if err == nil {
		r.metrics.lastResourceVersion.Set(float64(rv))
	}
}

3.4.2. store.Resync

resyncerrc := make(chan error, 1)
cancelCh := make(chan struct{})
defer close(cancelCh)
go func() {
	resyncCh, cleanup := r.resyncChan()
	defer func() {
		cleanup() // Call the last one written into cleanup
	}()
	for {
		select {
		case <-resyncCh:
		case <-stopCh:
			return
		case <-cancelCh:
			return
		}
		if r.ShouldResync == nil || r.ShouldResync() {
			glog.V(4).Infof("%s: forcing resync", r.name)
			if err := r.store.Resync(); err != nil {
				resyncerrc <- err
				return
			}
		}
		cleanup()
		resyncCh, cleanup = r.resyncChan()
	}
}()

核心代码:

err := r.store.Resync()

store的具体对象为DeltaFIFO,即调用DeltaFIFO.Resync

// Resync will send a sync event for each item
func (f *DeltaFIFO) Resync() error {
	f.lock.Lock()
	defer f.lock.Unlock()

	if f.knownObjects == nil {
		return nil
	}

	keys := f.knownObjects.ListKeys()
	for _, k := range keys {
		if err := f.syncKeyLocked(k); err != nil {
			return err
		}
	}
	return nil
}

3.4.3. Watch

for {
	// give the stopCh a chance to stop the loop, even in case of continue statements further down on errors
	select {
	case <-stopCh:
		return nil
	default:
	}

	timemoutseconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))
	options = metav1.ListOptions{
		ResourceVersion: resourceVersion,
		// We want to avoid situations of hanging watchers. Stop any wachers that do not
		// receive any events within the timeout window.
		TimeoutSeconds: &timemoutseconds,
	}

	r.metrics.numberOfWatches.Inc()
	w, err := r.listerWatcher.Watch(options)
	if err != nil {
		switch err {
		case io.EOF:
			// watch closed normally
		case io.ErrUnexpectedEOF:
			glog.V(1).Infof("%s: Watch for %v closed with unexpected EOF: %v", r.name, r.expectedType, err)
		default:
			utilruntime.HandleError(fmt.Errorf("%s: Failed to watch %v: %v", r.name, r.expectedType, err))
		}
		// If this is "connection refused" error, it means that most likely apiserver is not responsive.
		// It doesn't make sense to re-list all objects because most likely we will be able to restart
		// watch where we ended.
		// If that's the case wait and resend watch request.
		if urlError, ok := err.(*url.Error); ok {
			if opError, ok := urlError.Err.(*net.OpError); ok {
				if errno, ok := opError.Err.(syscall.Errno); ok && errno == syscall.ECONNREFUSED {
					time.Sleep(time.Second)
					continue
				}
			}
		}
		return nil
	}

	if err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh); err != nil {
		if err != errorStopRequested {
			glog.Warningf("%s: watch of %v ended with: %v", r.name, r.expectedType, err)
		}
		return nil
	}
}

设置watch的超时时间,默认为5分钟。

timemoutseconds := int64(minWatchTimeout.Seconds() * (rand.Float64() + 1.0))
options = metav1.ListOptions{
	ResourceVersion: resourceVersion,
	// We want to avoid situations of hanging watchers. Stop any wachers that do not
	// receive any events within the timeout window.
	TimeoutSeconds: &timemoutseconds,
}

执行listerWatcher.Watch(options)。

w, err := r.listerWatcher.Watch(options)

执行watchHandler。

err := r.watchHandler(w, &resourceVersion, resyncerrc, stopCh)

3.4.4. watchHandler

watchHandler主要是通过watch的方式保证当前的资源版本是最新的。

// watchHandler watches w and keeps *resourceVersion up to date.
func (r *Reflector) watchHandler(w watch.Interface, resourceVersion *string, errc chan error, stopCh <-chan struct{}) error {
	start := r.clock.Now()
	eventCount := 0

	// Stopping the watcher should be idempotent and if we return from this function there's no way
	// we're coming back in with the same watch interface.
	defer w.Stop()
	// update metrics
	defer func() {
		r.metrics.numberOfItemsInWatch.Observe(float64(eventCount))
		r.metrics.watchDuration.Observe(time.Since(start).Seconds())
	}()

loop:
	for {
		select {
		case <-stopCh:
			return errorStopRequested
		case err := <-errc:
			return err
		case event, ok := <-w.ResultChan():
			if !ok {
				break loop
			}
			if event.Type == watch.Error {
				return apierrs.FromObject(event.Object)
			}
			if e, a := r.expectedType, reflect.TypeOf(event.Object); e != nil && e != a {
				utilruntime.HandleError(fmt.Errorf("%s: expected type %v, but watch event object had type %v", r.name, e, a))
				continue
			}
			meta, err := meta.Accessor(event.Object)
			if err != nil {
				utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event))
				continue
			}
			newResourceVersion := meta.GetResourceVersion()
			switch event.Type {
			case watch.Added:
				err := r.store.Add(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err))
				}
			case watch.Modified:
				err := r.store.Update(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err))
				}
			case watch.Deleted:
				// TODO: Will any consumers need access to the "last known
				// state", which is passed in event.Object? If so, may need
				// to change this.
				err := r.store.Delete(event.Object)
				if err != nil {
					utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err))
				}
			default:
				utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event))
			}
			*resourceVersion = newResourceVersion
			r.setLastSyncResourceVersion(newResourceVersion)
			eventCount++
		}
	}

	watchDuration := r.clock.Now().Sub(start)
	if watchDuration < 1*time.Second && eventCount == 0 {
		r.metrics.numberOfShortWatches.Inc()
		return fmt.Errorf("very short watch: %s: Unexpected watch close - watch lasted less than a second and no items received", r.name)
	}
	glog.V(4).Infof("%s: Watch close - %v total %v items received", r.name, r.expectedType, eventCount)
	return nil
}

获取watch接口中的事件的channel,来获取事件的内容。

for {
	select {
	...
	case event, ok := <-w.ResultChan():
    ...
}        

当获得添加、更新、删除的事件时,将对应的对象更新到本地缓存store中。

switch event.Type {
case watch.Added:
	err := r.store.Add(event.Object)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("%s: unable to add watch event object (%#v) to store: %v", r.name, event.Object, err))
	}
case watch.Modified:
	err := r.store.Update(event.Object)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("%s: unable to update watch event object (%#v) to store: %v", r.name, event.Object, err))
	}
case watch.Deleted:
	// TODO: Will any consumers need access to the "last known
	// state", which is passed in event.Object? If so, may need
	// to change this.
	err := r.store.Delete(event.Object)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("%s: unable to delete watch event object (%#v) from store: %v", r.name, event.Object, err))
	}
default:
	utilruntime.HandleError(fmt.Errorf("%s: unable to understand watch event %#v", r.name, event))
}

更新当前的最新版本号。

newResourceVersion := meta.GetResourceVersion()
*resourceVersion = newResourceVersion
r.setLastSyncResourceVersion(newResourceVersion)

通过对Reflector模块的分析,可以看到多次使用到本地缓存store模块,而store的数据由DeltaFIFO赋值而来,以下针对DeltaFIFO和store做分析。

4. DeltaFIFO

DeltaFIFO由NewDeltaFIFO初始化,并赋值给config.Queue。

func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
	fifo := NewDeltaFIFO(MetaNamespaceKeyFunc, nil, s.indexer)

	cfg := &Config{
		Queue:            fifo,
		...
	}
    ...
}    

4.1. NewDeltaFIFO

// NewDeltaFIFO returns a Store which can be used process changes to items.
//
// keyFunc is used to figure out what key an object should have. (It's
// exposed in the returned DeltaFIFO's KeyOf() method, with bonus features.)
//
// 'compressor' may compress as many or as few items as it wants
// (including returning an empty slice), but it should do what it
// does quickly since it is called while the queue is locked.
// 'compressor' may be nil if you don't want any delta compression.
//
// 'keyLister' is expected to return a list of keys that the consumer of
// this queue "knows about". It is used to decide which items are missing
// when Replace() is called; 'Deleted' deltas are produced for these items.
// It may be nil if you don't need to detect all deletions.
// TODO: consider merging keyLister with this object, tracking a list of
//       "known" keys when Pop() is called. Have to think about how that
//       affects error retrying.
// TODO(lavalamp): I believe there is a possible race only when using an
//                 external known object source that the above TODO would
//                 fix.
//
// Also see the comment on DeltaFIFO.
func NewDeltaFIFO(keyFunc KeyFunc, compressor DeltaCompressor, knownObjects KeyListerGetter) *DeltaFIFO {
	f := &DeltaFIFO{
		items:           map[string]Deltas{},
		queue:           []string{},
		keyFunc:         keyFunc,
		deltaCompressor: compressor,
		knownObjects:    knownObjects,
	}
	f.cond.L = &f.lock
	return f
}

controller.Run的部分调用了NewReflector。

func (c *controller) Run(stopCh <-chan struct{}) {
	...
	r := NewReflector(
		c.config.ListerWatcher,
		c.config.ObjectType,
		c.config.Queue,
		c.config.FullResyncPeriod,
	)
    ...
}    

NewReflector构造函数,将c.config.Queue赋值给Reflector.store的属性。

func NewReflector(lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
	return NewNamedReflector(getDefaultReflectorName(internalPackages...), lw, expectedType, store, resyncPeriod)
}

// NewNamedReflector same as NewReflector, but with a specified name for logging
func NewNamedReflector(name string, lw ListerWatcher, expectedType interface{}, store Store, resyncPeriod time.Duration) *Reflector {
	reflectorSuffix := atomic.AddInt64(&reflectorDisambiguator, 1)
	r := &Reflector{
		name: name,
		// we need this to be unique per process (some names are still the same)but obvious who it belongs to
		metrics:       newReflectorMetrics(makeValidPromethusMetricLabel(fmt.Sprintf("reflector_"+name+"_%d", reflectorSuffix))),
		listerWatcher: lw,
		store:         store,
		expectedType:  reflect.TypeOf(expectedType),
		period:        time.Second,
		resyncPeriod:  resyncPeriod,
		clock:         &clock.RealClock{},
	}
	return r
}

4.2. DeltaFIFO

DeltaFIFO是一个生产者与消费者的队列,其中Reflector是生产者,消费者调用Pop()的方法。

DeltaFIFO主要用在以下场景:

  • 希望对象变更最多处理一次
  • 处理对象时,希望查看自上次处理对象以来发生的所有事情
  • 要处理对象的删除
  • 希望定期重新处理对象
// DeltaFIFO is like FIFO, but allows you to process deletes.
//
// DeltaFIFO is a producer-consumer queue, where a Reflector is
// intended to be the producer, and the consumer is whatever calls
// the Pop() method.
//
// DeltaFIFO solves this use case:
//  * You want to process every object change (delta) at most once.
//  * When you process an object, you want to see everything
//    that's happened to it since you last processed it.
//  * You want to process the deletion of objects.
//  * You might want to periodically reprocess objects.
//
// DeltaFIFO's Pop(), Get(), and GetByKey() methods return
// interface{} to satisfy the Store/Queue interfaces, but it
// will always return an object of type Deltas.
//
// A note on threading: If you call Pop() in parallel from multiple
// threads, you could end up with multiple threads processing slightly
// different versions of the same object.
//
// A note on the KeyLister used by the DeltaFIFO: It's main purpose is
// to list keys that are "known", for the purpose of figuring out which
// items have been deleted when Replace() or Delete() are called. The deleted
// object will be included in the DeleteFinalStateUnknown markers. These objects
// could be stale.
//
// You may provide a function to compress deltas (e.g., represent a
// series of Updates as a single Update).
type DeltaFIFO struct {
	// lock/cond protects access to 'items' and 'queue'.
	lock sync.RWMutex
	cond sync.Cond

	// We depend on the property that items in the set are in
	// the queue and vice versa, and that all Deltas in this
	// map have at least one Delta.
	items map[string]Deltas
	queue []string

	// populated is true if the first batch of items inserted by Replace() has been populated
	// or Delete/Add/Update was called first.
	populated bool
	// initialPopulationCount is the number of items inserted by the first call of Replace()
	initialPopulationCount int

	// keyFunc is used to make the key used for queued item
	// insertion and retrieval, and should be deterministic.
	keyFunc KeyFunc

	// deltaCompressor tells us how to combine two or more
	// deltas. It may be nil.
	deltaCompressor DeltaCompressor

	// knownObjects list keys that are "known", for the
	// purpose of figuring out which items have been deleted
	// when Replace() or Delete() is called.
	knownObjects KeyListerGetter

	// Indication the queue is closed.
	// Used to indicate a queue is closed so a control loop can exit when a queue is empty.
	// Currently, not used to gate any of CRED operations.
	closed     bool
	closedLock sync.Mutex
}

4.3. Queue & Store

DeltaFIFO的类型是Queue接口,Reflector.store是Store接口,Queue接口是一个存储队列,Process的方法执行Queue.Pop出来的数据对象,

// Queue is exactly like a Store, but has a Pop() method too.
type Queue interface {
	Store

	// Pop blocks until it has something to process.
	// It returns the object that was process and the result of processing.
	// The PopProcessFunc may return an ErrRequeue{...} to indicate the item
	// should be requeued before releasing the lock on the queue.
	Pop(PopProcessFunc) (interface{}, error)

	// AddIfNotPresent adds a value previously
	// returned by Pop back into the queue as long
	// as nothing else (presumably more recent)
	// has since been added.
	AddIfNotPresent(interface{}) error

	// Return true if the first batch of items has been popped
	HasSynced() bool

	// Close queue
	Close()
}

5. store

Store是一个通用的存储接口,Reflector通过watch server的方式更新数据到store中,store给Reflector提供本地的缓存,让Reflector可以像消息队列一样的工作。

Store实现的是一种可以准确的写入对象和获取对象的机制。

// Store is a generic object storage interface. Reflector knows how to watch a server
// and update a store. A generic store is provided, which allows Reflector to be used
// as a local caching system, and an LRU store, which allows Reflector to work like a
// queue of items yet to be processed.
//
// Store makes no assumptions about stored object identity; it is the responsibility
// of a Store implementation to provide a mechanism to correctly key objects and to
// define the contract for obtaining objects by some arbitrary key type.
type Store interface {
	Add(obj interface{}) error
	Update(obj interface{}) error
	Delete(obj interface{}) error
	List() []interface{}
	ListKeys() []string
	Get(obj interface{}) (item interface{}, exists bool, err error)
	GetByKey(key string) (item interface{}, exists bool, err error)

	// Replace will delete the contents of the store, using instead the
	// given list. Store takes ownership of the list, you should not reference
	// it after calling this function.
	Replace([]interface{}, string) error
	Resync() error
}

其中Replace方法会删除原来store中的内容,并将新增的list的内容存入store中,即完全替换数据。

6.1. cache

cache实现了store的接口,而cache的具体实现又是调用ThreadSafeStore接口来实现功能的。

cache的功能主要有以下两点:

  • 通过keyFunc计算对象的key
  • 调用ThreadSafeStorage接口的方法
// cache responsibilities are limited to:
//	1. Computing keys for objects via keyFunc
//  2. Invoking methods of a ThreadSafeStorage interface
type cache struct {
	// cacheStorage bears the burden of thread safety for the cache
	cacheStorage ThreadSafeStore
	// keyFunc is used to make the key for objects stored in and retrieved from items, and
	// should be deterministic.
	keyFunc KeyFunc
}

其中ListAndWatch主要用到以下的方法:

cache.Replace

// Replace will delete the contents of 'c', using instead the given list.
// 'c' takes ownership of the list, you should not reference the list again
// after calling this function.
func (c *cache) Replace(list []interface{}, resourceVersion string) error {
	items := map[string]interface{}{}
	for _, item := range list {
		key, err := c.keyFunc(item)
		if err != nil {
			return KeyError{item, err}
		}
		items[key] = item
	}
	c.cacheStorage.Replace(items, resourceVersion)
	return nil
}

cache.Add

// Add inserts an item into the cache.
func (c *cache) Add(obj interface{}) error {
	key, err := c.keyFunc(obj)
	if err != nil {
		return KeyError{obj, err}
	}
	c.cacheStorage.Add(key, obj)
	return nil
}

cache.Update

// Update sets an item in the cache to its updated state.
func (c *cache) Update(obj interface{}) error {
	key, err := c.keyFunc(obj)
	if err != nil {
		return KeyError{obj, err}
	}
	c.cacheStorage.Update(key, obj)
	return nil
}

cache.Delete

// Delete removes an item from the cache.
func (c *cache) Delete(obj interface{}) error {
	key, err := c.keyFunc(obj)
	if err != nil {
		return KeyError{obj, err}
	}
	c.cacheStorage.Delete(key)
	return nil
}

6.2. ThreadSafeStore

cache的具体是调用ThreadSafeStore来实现的。

// ThreadSafeStore is an interface that allows concurrent access to a storage backend.
// TL;DR caveats: you must not modify anything returned by Get or List as it will break
// the indexing feature in addition to not being thread safe.
//
// The guarantees of thread safety provided by List/Get are only valid if the caller
// treats returned items as read-only. For example, a pointer inserted in the store
// through `Add` will be returned as is by `Get`. Multiple clients might invoke `Get`
// on the same key and modify the pointer in a non-thread-safe way. Also note that
// modifying objects stored by the indexers (if any) will *not* automatically lead
// to a re-index. So it's not a good idea to directly modify the objects returned by
// Get/List, in general.
type ThreadSafeStore interface {
	Add(key string, obj interface{})
	Update(key string, obj interface{})
	Delete(key string)
	Get(key string) (item interface{}, exists bool)
	List() []interface{}
	ListKeys() []string
	Replace(map[string]interface{}, string)
	Index(indexName string, obj interface{}) ([]interface{}, error)
	IndexKeys(indexName, indexKey string) ([]string, error)
	ListIndexFuncValues(name string) []string
	ByIndex(indexName, indexKey string) ([]interface{}, error)
	GetIndexers() Indexers

	// AddIndexers adds more indexers to this store.  If you call this after you already have data
	// in the store, the results are undefined.
	AddIndexers(newIndexers Indexers) error
	Resync() error
}

threadSafeMap

// threadSafeMap implements ThreadSafeStore
type threadSafeMap struct {
	lock  sync.RWMutex
	items map[string]interface{}

	// indexers maps a name to an IndexFunc
	indexers Indexers
	// indices maps a name to an Index
	indices Indices
}

6. processLoop

func (c *controller) Run(stopCh <-chan struct{}) {
	...
	wait.Until(c.processLoop, time.Second, stopCh)
}

在controller.Run方法中会调用processLoop,以下分析processLoop的处理逻辑。

// processLoop drains the work queue.
// TODO: Consider doing the processing in parallel. This will require a little thought
// to make sure that we don't end up processing the same object multiple times
// concurrently.
//
// TODO: Plumb through the stopCh here (and down to the queue) so that this can
// actually exit when the controller is stopped. Or just give up on this stuff
// ever being stoppable. Converting this whole package to use Context would
// also be helpful.
func (c *controller) processLoop() {
	for {
		obj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process))
		if err != nil {
			if err == FIFOClosedError {
				return
			}
			if c.config.RetryOnError {
				// This is the safe way to re-enqueue.
				c.config.Queue.AddIfNotPresent(obj)
			}
		}
	}
}

processLoop主要处理任务队列中的任务,其中处理逻辑是调用具体的ProcessFunc函数来实现,核心代码为:

obj, err := c.config.Queue.Pop(PopProcessFunc(c.config.Process))

5.1. DeltaFIFO.Pop

Pop会阻塞住直到队列里面添加了新的对象,如果有多个对象,按照先进先出的原则处理,如果某个对象没有处理成功会重新被加入该队列中。

Pop中会调用具体的process函数来处理对象。

// Pop blocks until an item is added to the queue, and then returns it.  If
// multiple items are ready, they are returned in the order in which they were
// added/updated. The item is removed from the queue (and the store) before it
// is returned, so if you don't successfully process it, you need to add it back
// with AddIfNotPresent().
// process function is called under lock, so it is safe update data structures
// in it that need to be in sync with the queue (e.g. knownKeys). The PopProcessFunc
// may return an instance of ErrRequeue with a nested error to indicate the current
// item should be requeued (equivalent to calling AddIfNotPresent under the lock).
//
// Pop returns a 'Deltas', which has a complete list of all the things
// that happened to the object (deltas) while it was sitting in the queue.
func (f *DeltaFIFO) Pop(process PopProcessFunc) (interface{}, error) {
	f.lock.Lock()
	defer f.lock.Unlock()
	for {
		for len(f.queue) == 0 {
			// When the queue is empty, invocation of Pop() is blocked until new item is enqueued.
			// When Close() is called, the f.closed is set and the condition is broadcasted.
			// Which causes this loop to continue and return from the Pop().
			if f.IsClosed() {
				return nil, FIFOClosedError
			}

			f.cond.Wait()
		}
		id := f.queue[0]
		f.queue = f.queue[1:]
		item, ok := f.items[id]
		if f.initialPopulationCount > 0 {
			f.initialPopulationCount--
		}
		if !ok {
			// Item may have been deleted subsequently.
			continue
		}
		delete(f.items, id)
		err := process(item)
		if e, ok := err.(ErrRequeue); ok {
			f.addIfNotPresent(id, item)
			err = e.Err
		}
		// Don't need to copyDeltas here, because we're transferring
		// ownership to the caller.
		return item, err
	}
}

核心代码:

for {
	...
	item, ok := f.items[id]
	...
	err := process(item)
	if e, ok := err.(ErrRequeue); ok {
		f.addIfNotPresent(id, item)
		err = e.Err
	}
	// Don't need to copyDeltas here, because we're transferring
	// ownership to the caller.
	return item, err
}

5.2. HandleDeltas

cfg := &Config{
	Queue:            fifo,
	ListerWatcher:    s.listerWatcher,
	ObjectType:       s.objectType,
	FullResyncPeriod: s.resyncCheckPeriod,
	RetryOnError:     false,
	ShouldResync:     s.processor.shouldResync,

	Process: s.HandleDeltas,
}

其中process函数就是在sharedIndexInformer.Run方法中,给config.Process赋值的HandleDeltas函数。

func (s *sharedIndexInformer) HandleDeltas(obj interface{}) error {
	s.blockDeltas.Lock()
	defer s.blockDeltas.Unlock()

	// from oldest to newest
	for _, d := range obj.(Deltas) {
		switch d.Type {
		case Sync, Added, Updated:
			isSync := d.Type == Sync
			s.cacheMutationDetector.AddObject(d.Object)
			if old, exists, err := s.indexer.Get(d.Object); err == nil && exists {
				if err := s.indexer.Update(d.Object); err != nil {
					return err
				}
				s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)
			} else {
				if err := s.indexer.Add(d.Object); err != nil {
					return err
				}
				s.processor.distribute(addNotification{newObj: d.Object}, isSync)
			}
		case Deleted:
			if err := s.indexer.Delete(d.Object); err != nil {
				return err
			}
			s.processor.distribute(deleteNotification{oldObj: d.Object}, false)
		}
	}
	return nil
}

核心代码:

switch d.Type {
case Sync, Added, Updated:
	...
	if old, exists, err := s.indexer.Get(d.Object); err == nil && exists {
		...
		s.processor.distribute(updateNotification{oldObj: old, newObj: d.Object}, isSync)
	} else {
		...
		s.processor.distribute(addNotification{newObj: d.Object}, isSync)
	}
case Deleted:
	...
	s.processor.distribute(deleteNotification{oldObj: d.Object}, false)
}

根据不同的类型,调用processor.distribute方法,该方法将对象加入processorListener的channel中。

5.3. sharedProcessor.distribute

func (p *sharedProcessor) distribute(obj interface{}, sync bool) {
	p.listenersLock.RLock()
	defer p.listenersLock.RUnlock()

	if sync {
		for _, listener := range p.syncingListeners {
			listener.add(obj)
		}
	} else {
		for _, listener := range p.listeners {
			listener.add(obj)
		}
	}
}

processorListener.add:

func (p *processorListener) add(notification interface{}) {
	p.addCh <- notification
}

综合以上的分析,可以看出processLoop通过调用HandleDeltas,再调用distribute,processorListener.add最终将不同更新类型的对象加入processorListener的channel中,供processorListener.Run使用。以下分析processorListener.Run的部分。

7. processor

processor的主要功能就是记录了所有的回调函数实例(即 ResourceEventHandler 实例),并负责触发这些函数。在sharedIndexInformer.Run部分会调用processor.run。

流程:

  1. listenser的add函数负责将notify装进pendingNotifications。
  2. pop函数取出pendingNotifications的第一个nofify,输出到nextCh channel。
  3. run函数则负责取出notify,然后根据notify的类型(增加、删除、更新)触发相应的处理函数,这些函数是在不同的NewXxxcontroller实现中注册的。
func (s *sharedIndexInformer) Run(stopCh <-chan struct{}) {
	...
	wg.StartWithChannel(processorStopCh, s.processor.run)
	...
}

7.1. sharedProcessor.Run

func (p *sharedProcessor) run(stopCh <-chan struct{}) {
   func() {
      p.listenersLock.RLock()
      defer p.listenersLock.RUnlock()
      for _, listener := range p.listeners {
         p.wg.Start(listener.run)
         p.wg.Start(listener.pop)
      }
   }()
   <-stopCh
   p.listenersLock.RLock()
   defer p.listenersLock.RUnlock()
   for _, listener := range p.listeners {
      close(listener.addCh) // Tell .pop() to stop. .pop() will tell .run() to stop
   }
   p.wg.Wait() // Wait for all .pop() and .run() to stop
}

7.1.1. listener.pop

pop函数取出pendingNotifications的第一个nofify,输出到nextCh channel。

func (p *processorListener) pop() {
	defer utilruntime.HandleCrash()
	defer close(p.nextCh) // Tell .run() to stop

	var nextCh chan<- interface{}
	var notification interface{}
	for {
		select {
		case nextCh <- notification:
			// Notification dispatched
			var ok bool
			notification, ok = p.pendingNotifications.ReadOne()
			if !ok { // Nothing to pop
				nextCh = nil // Disable this select case
			}
		case notificationToAdd, ok := <-p.addCh:
			if !ok {
				return
			}
			if notification == nil { // No notification to pop (and pendingNotifications is empty)
				// Optimize the case - skip adding to pendingNotifications
				notification = notificationToAdd
				nextCh = p.nextCh
			} else { // There is already a notification waiting to be dispatched
				p.pendingNotifications.WriteOne(notificationToAdd)
			}
		}
	}
}

7.1.2. listener.run

listener.run部分根据不同的更新类型调用不同的处理函数。

func (p *processorListener) run() {
	defer utilruntime.HandleCrash()

	for next := range p.nextCh {
		switch notification := next.(type) {
		case updateNotification:
			p.handler.OnUpdate(notification.oldObj, notification.newObj)
		case addNotification:
			p.handler.OnAdd(notification.newObj)
		case deleteNotification:
			p.handler.OnDelete(notification.oldObj)
		default:
			utilruntime.HandleError(fmt.Errorf("unrecognized notification: %#v", next))
		}
	}
}

其中具体的实现函数handler是在NewDeploymentController(其他不同类型的controller类似)中赋值的,而该handler是一个接口,具体如下:

// ResourceEventHandler can handle notifications for events that happen to a
// resource. The events are informational only, so you can't return an
// error.
//  * OnAdd is called when an object is added.
//  * OnUpdate is called when an object is modified. Note that oldObj is the
//      last known state of the object-- it is possible that several changes
//      were combined together, so you can't use this to see every single
//      change. OnUpdate is also called when a re-list happens, and it will
//      get called even if nothing changed. This is useful for periodically
//      evaluating or syncing something.
//  * OnDelete will get the final state of the item if it is known, otherwise
//      it will get an object of type DeletedFinalStateUnknown. This can
//      happen if the watch is closed and misses the delete event and we don't
//      notice the deletion until the subsequent re-list.
type ResourceEventHandler interface {
	OnAdd(obj interface{})
	OnUpdate(oldObj, newObj interface{})
	OnDelete(obj interface{})
}

7.2. ResourceEventHandler

以下以DeploymentController的处理逻辑为例。

NewDeploymentController部分会注册deployment的事件函数,以下注册了三种类型的事件函数,其中包括:dInformer、rsInformer和podInformer。

// NewDeploymentController creates a new DeploymentController.
func NewDeploymentController(dInformer extensionsinformers.DeploymentInformer, rsInformer extensionsinformers.ReplicaSetInformer, podInformer coreinformers.PodInformer, client clientset.Interface) (*DeploymentController, error) {
	...
	dInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    dc.addDeployment,
		UpdateFunc: dc.updateDeployment,
		// This will enter the sync loop and no-op, because the deployment has been deleted from the store.
		DeleteFunc: dc.deleteDeployment,
	})
	rsInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		AddFunc:    dc.addReplicaSet,
		UpdateFunc: dc.updateReplicaSet,
		DeleteFunc: dc.deleteReplicaSet,
	})
	podInformer.Informer().AddEventHandler(cache.ResourceEventHandlerFuncs{
		DeleteFunc: dc.deletePod,
	})
    ...
}    

7.2.1. addDeployment

以下以addDeployment为例,addDeployment主要是将对象加入到enqueueDeployment的队列中。

func (dc *DeploymentController) addDeployment(obj interface{}) {
	d := obj.(*extensions.Deployment)
	glog.V(4).Infof("Adding deployment %s", d.Name)
	dc.enqueueDeployment(d)
}

enqueueDeployment的定义

type DeploymentController struct {
	...
	enqueueDeployment func(deployment *extensions.Deployment)
    ...
}    

将dc.enqueue赋值给dc.enqueueDeployment

dc.enqueueDeployment = dc.enqueue

dc.enqueue调用了dc.queue.Add(key)

func (dc *DeploymentController) enqueue(deployment *extensions.Deployment) {
	key, err := controller.KeyFunc(deployment)
	if err != nil {
		utilruntime.HandleError(fmt.Errorf("Couldn't get key for object %#v: %v", deployment, err))
		return
	}

	dc.queue.Add(key)
}

dc.queue主要记录了需要被同步的deployment的对象,供syncDeployment使用。

dc := &DeploymentController{
	...
	queue:         workqueue.NewNamedRateLimitingQueue(workqueue.DefaultControllerRateLimiter(), "deployment"),
}

NewNamedRateLimitingQueue

func NewNamedRateLimitingQueue(rateLimiter RateLimiter, name string) RateLimitingInterface {
	return &rateLimitingType{
		DelayingInterface: NewNamedDelayingQueue(name),
		rateLimiter:       rateLimiter,
	}
}

通过以上分析,可以看出processor记录了不同类似的事件函数,其中事件函数在NewXxxController构造函数部分注册,具体事件函数的处理,一般是将需要处理的对象加入对应的controller的任务队列中,然后由类似syncDeployment的同步函数来维持期望状态的同步逻辑。

8. 总结

本文分析的部分主要是k8s的informer机制,即List-Watch机制。

8.1. Reflector

Reflector的主要作用是watch指定的k8s资源,并将变化同步到本地是store中。Reflector只会放置指定的expectedType类型的资源到store中,除非expectedType为nil。如果resyncPeriod不为零,那么Reflector为以resyncPeriod为周期定期执行list的操作,这样就可以使用Reflector来定期处理所有的对象,也可以逐步处理变化的对象。

8.2. ListAndWatch

ListAndWatch第一次会列出所有的对象,并获取资源对象的版本号,然后watch资源对象的版本号来查看是否有被变更。首先会将资源版本号设置为0,list()可能会导致本地的缓存相对于etcd里面的内容存在延迟,Reflector会通过watch的方法将延迟的部分补充上,使得本地的缓存数据与etcd的数据保持一致。

8.3. DeltaFIFO

DeltaFIFO是一个生产者与消费者的队列,其中Reflector是生产者,消费者调用Pop()的方法。

DeltaFIFO主要用在以下场景:

  • 希望对象变更最多处理一次
  • 处理对象时,希望查看自上次处理对象以来发生的所有事情
  • 要处理对象的删除
  • 希望定期重新处理对象

8.4. store

Store是一个通用的存储接口,Reflector通过watch server的方式更新数据到store中,store给Reflector提供本地的缓存,让Reflector可以像消息队列一样的工作。

Store实现的是一种可以准确的写入对象和获取对象的机制。

8.5. processor

processor的主要功能就是记录了所有的回调函数实例(即 ResourceEventHandler 实例),并负责触发这些函数。在sharedIndexInformer.Run部分会调用processor.run。

流程:

  1. listenser的add函数负责将notify装进pendingNotifications。
  2. pop函数取出pendingNotifications的第一个nofify,输出到nextCh channel。
  3. run函数则负责取出notify,然后根据notify的类型(增加、删除、更新)触发相应的处理函数,这些函数是在不同的NewXxxcontroller实现中注册的。

processor记录了不同类似的事件函数,其中事件函数在NewXxxController构造函数部分注册,具体事件函数的处理,一般是将需要处理的对象加入对应的controller的任务队列中,然后由类似syncDeployment的同步函数来维持期望状态的同步逻辑。

8.6. 主要步骤

  1. 在controller-manager的Run函数部分调用了InformerFactory.Start的方法,Start方法初始化各种类型的informer,并且每个类型起了个informer.Run的goroutine。
  2. informer.Run的部分先生成一个DeltaFIFO的队列来存储对象变化的数据。然后调用processor.Run和controller.Run函数。
  3. controller.Run函数会生成一个Reflector,Reflector的主要作用是watch指定的k8s资源,并将变化同步到本地是store中。ReflectorresyncPeriod为周期定期执行list的操作,这样就可以使用Reflector来定期处理所有的对象,也可以逐步处理变化的对象。
  4. Reflector接着执行ListAndWatch函数,ListAndWatch第一次会列出所有的对象,并获取资源对象的版本号,然后watch资源对象的版本号来查看是否有被变更。首先会将资源版本号设置为0,list()可能会导致本地的缓存相对于etcd里面的内容存在延迟,Reflector会通过watch的方法将延迟的部分补充上,使得本地的缓存数据与etcd的数据保持一致。
  5. controller.Run函数还会调用processLoop函数,processLoop通过调用HandleDeltas,再调用distribute,processorListener.add最终将不同更新类型的对象加入processorListener的channel中,供processorListener.Run使用。
  6. processor的主要功能就是记录了所有的回调函数实例(即 ResourceEventHandler 实例),并负责触发这些函数。processor记录了不同类型的事件函数,其中事件函数在NewXxxController构造函数部分注册,具体事件函数的处理,一般是将需要处理的对象加入对应的controller的任务队列中,然后由类似syncDeployment的同步函数来维持期望状态的同步逻辑。

参考文章:

11.4 -

11.4.1 -

kube-scheduler源码分析(四)之 findNodesThatFit

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析调度逻辑中的预选策略,即第一步筛选出符合pod调度条件的节点。

1. 调用入口

预选,通过预选函数来判断每个节点是否适合被该Pod调度。

genericScheduler.Schedule中对findNodesThatFit的调用过程如下:

此部分代码位于pkg/scheduler/core/generic_scheduler.go

func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
	...
  // 列出所有的节点
	nodes, err := nodeLister.List()
	if err != nil {
		return "", err
	}
	if len(nodes) == 0 {
		return "", ErrNoNodesAvailable
	}

	// Used for all fit and priority funcs.
	err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
	if err != nil {
		return "", err
	}

	trace.Step("Computing predicates")
	startPredicateEvalTime := time.Now()
  // 调用findNodesThatFit过滤出预选节点
	filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
	if err != nil {
		return "", err
	}

	if len(filteredNodes) == 0 {
		return "", &FitError{
			Pod:              pod,
			NumAllNodes:      len(nodes),
			FailedPredicates: failedPredicateMap,
		}
	}
// metrics
  metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
			  metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))
	...
}  

核心代码:

// 调用findNodesThatFit过滤出预选节点
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)

2. findNodesThatFit

findNodesThatFit基于给定的预选函数过滤node,每个node传入到预选函数中来确实该节点是否符合要求。

findNodesThatFit的入参是被调度的pod和当前的节点列表,返回预选节点列表和错误。

findNodesThatFit基本流程如下:

  1. 设置可行节点的总数,作为预选节点数组的容量,避免总节点过多需要筛选的节点过多。
  2. 通过NodeTree不断获取下一个节点来判断该节点是否满足pod的调度条件。
  3. 通过之前注册的各种预选函数来判断当前节点是否符合pod的调度条件。
  4. 最后返回满足调度条件的node列表,供下一步的优选操作。

findNodesThatFit完整代码如下:

此部分代码位于pkg/scheduler/core/generic_scheduler.go

// Filters the nodes to find the ones that fit based on the given predicate functions
// Each node is passed through the predicate functions to determine if it is a fit
func (g *genericScheduler) findNodesThatFit(pod *v1.Pod, nodes []*v1.Node) ([]*v1.Node, FailedPredicateMap, error) {
   var filtered []*v1.Node
   failedPredicateMap := FailedPredicateMap{}

   if len(g.predicates) == 0 {
      filtered = nodes
   } else {
      allNodes := int32(g.cache.NodeTree().NumNodes)
      numNodesToFind := g.numFeasibleNodesToFind(allNodes)

      // Create filtered list with enough space to avoid growing it
      // and allow assigning.
      filtered = make([]*v1.Node, numNodesToFind)
      errs := errors.MessageCountMap{}
      var (
         predicateResultLock sync.Mutex
         filteredLen         int32
         equivClass          *equivalence.Class
      )

      ctx, cancel := context.WithCancel(context.Background())

      // We can use the same metadata producer for all nodes.
      meta := g.predicateMetaProducer(pod, g.cachedNodeInfoMap)

      if g.equivalenceCache != nil {
         // getEquivalenceClassInfo will return immediately if no equivalence pod found
         equivClass = equivalence.NewClass(pod)
      }

      checkNode := func(i int) {
         var nodeCache *equivalence.NodeCache
         nodeName := g.cache.NodeTree().Next()
         if g.equivalenceCache != nil {
            nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
         }
         fits, failedPredicates, err := podFitsOnNode(
            pod,
            meta,
            g.cachedNodeInfoMap[nodeName],
            g.predicates,
            g.cache,
            nodeCache,
            g.schedulingQueue,
            g.alwaysCheckAllPredicates,
            equivClass,
         )
         if err != nil {
            predicateResultLock.Lock()
            errs[err.Error()]++
            predicateResultLock.Unlock()
            return
         }
         if fits {
            length := atomic.AddInt32(&filteredLen, 1)
            if length > numNodesToFind {
               cancel()
               atomic.AddInt32(&filteredLen, -1)
            } else {
               filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
            }
         } else {
            predicateResultLock.Lock()
            failedPredicateMap[nodeName] = failedPredicates
            predicateResultLock.Unlock()
         }
      }

      // Stops searching for more nodes once the configured number of feasible nodes
      // are found.
      workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

      filtered = filtered[:filteredLen]
      if len(errs) > 0 {
         return []*v1.Node{}, FailedPredicateMap{}, errors.CreateAggregateFromMessageCountMap(errs)
      }
   }

   if len(filtered) > 0 && len(g.extenders) != 0 {
      for _, extender := range g.extenders {
         if !extender.IsInterested(pod) {
            continue
         }
         filteredList, failedMap, err := extender.Filter(pod, filtered, g.cachedNodeInfoMap)
         if err != nil {
            if extender.IsIgnorable() {
               glog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
                  extender, err)
               continue
            } else {
               return []*v1.Node{}, FailedPredicateMap{}, err
            }
         }

         for failedNodeName, failedMsg := range failedMap {
            if _, found := failedPredicateMap[failedNodeName]; !found {
               failedPredicateMap[failedNodeName] = []algorithm.PredicateFailureReason{}
            }
            failedPredicateMap[failedNodeName] = append(failedPredicateMap[failedNodeName], predicates.NewFailureReason(failedMsg))
         }
         filtered = filteredList
         if len(filtered) == 0 {
            break
         }
      }
   }
   return filtered, failedPredicateMap, nil
}

以下对findNodesThatFit分段分析。

3. numFeasibleNodesToFind

findNodesThatFit先基于所有的节点找出可行的节点是总数。numFeasibleNodesToFind的作用主要是避免当节点过多(超过100)影响调度的效率。

allNodes := int32(g.cache.NodeTree().NumNodes)
numNodesToFind := g.numFeasibleNodesToFind(allNodes)

// Create filtered list with enough space to avoid growing it
// and allow assigning.
filtered = make([]*v1.Node, numNodesToFind)

numFeasibleNodesToFind基本流程如下:

  • 如果所有的node节点小于minFeasibleNodesToFind(当前默认为100)则返回节点数。
  • 如果节点数超100,则取指定计分的百分比的节点数,当该百分比后的数目仍小于minFeasibleNodesToFind,则返回minFeasibleNodesToFind
  • 如果百分比后的数目大于minFeasibleNodesToFind,则返回该百分比。
// numFeasibleNodesToFind returns the number of feasible nodes that once found, the scheduler stops
// its search for more feasible nodes.
func (g *genericScheduler) numFeasibleNodesToFind(numAllNodes int32) int32 {
	if numAllNodes < minFeasibleNodesToFind || g.percentageOfNodesToScore <= 0 ||
		g.percentageOfNodesToScore >= 100 {
		return numAllNodes
	}
	numNodes := numAllNodes * g.percentageOfNodesToScore / 100
	if numNodes < minFeasibleNodesToFind {
		return minFeasibleNodesToFind
	}
	return numNodes
}

4. checkNode

checkNode是一个校验node是否符合要求的函数,其中实际调用到的核心函数是podFitsOnNode。再通过workqueue并发执行checkNode操作。

checkNode主要流程如下:

  1. 通过cache中的nodeTree不断获取下一个node。
  2. 将当前node和pod传入podFitsOnNode判断当前node是否符合要求。
  3. 如果当前node符合要求就将当前node加入预选节点的数组中filtered
  4. 如果当前node不满足要求,则加入到失败的数组中,并记录原因。
  5. 通过workqueue.ParallelizeUntil并发执行checkNode函数,一旦找到配置的可行节点数,就停止搜索更多节点。
checkNode := func(i int) {
	var nodeCache *equivalence.NodeCache
	nodeName := g.cache.NodeTree().Next()
	if g.equivalenceCache != nil {
		nodeCache, _ = g.equivalenceCache.GetNodeCache(nodeName)
	}
	fits, failedPredicates, err := podFitsOnNode(
		pod,
		meta,
		g.cachedNodeInfoMap[nodeName],
		g.predicates,
		g.cache,
		nodeCache,
		g.schedulingQueue,
		g.alwaysCheckAllPredicates,
		equivClass,
	)
	if err != nil {
		predicateResultLock.Lock()
		errs[err.Error()]++
		predicateResultLock.Unlock()
		return
	}
	if fits {
		length := atomic.AddInt32(&filteredLen, 1)
		if length > numNodesToFind {
			cancel()
			atomic.AddInt32(&filteredLen, -1)
		} else {
			filtered[length-1] = g.cachedNodeInfoMap[nodeName].Node()
		}
	} else {
		predicateResultLock.Lock()
		failedPredicateMap[nodeName] = failedPredicates
		predicateResultLock.Unlock()
	}
}

workqueue的并发操作:

// Stops searching for more nodes once the configured number of feasible nodes
// are found.
workqueue.ParallelizeUntil(ctx, 16, int(allNodes), checkNode)

ParallelizeUntil具体代码如下:

// ParallelizeUntil is a framework that allows for parallelizing N
// independent pieces of work until done or the context is canceled.
func ParallelizeUntil(ctx context.Context, workers, pieces int, doWorkPiece DoWorkPieceFunc) {
	var stop <-chan struct{}
	if ctx != nil {
		stop = ctx.Done()
	}

	toProcess := make(chan int, pieces)
	for i := 0; i < pieces; i++ {
		toProcess <- i
	}
	close(toProcess)

	if pieces < workers {
		workers = pieces
	}

	wg := sync.WaitGroup{}
	wg.Add(workers)
	for i := 0; i < workers; i++ {
		go func() {
			defer utilruntime.HandleCrash()
			defer wg.Done()
			for piece := range toProcess {
				select {
				case <-stop:
					return
				default:
					doWorkPiece(piece)
				}
			}
		}()
	}
	wg.Wait()
}

5. podFitsOnNode

podFitsOnNode主要内容如下:

  • podFitsOnNode会检查给定的某个Node是否满足预选的函数。

  • 对于给定的pod,podFitsOnNode会检查是否有相同的pod存在,尽量复用缓存过的预选结果。

podFitsOnNode主要在Schedule(调度)和Preempt(抢占)的时候被调用。

当在Schedule中被调用的时候,主要判断是否可以被调度到当前节点,依据为当前节点上所有已存在的pod及被提名要运行到该节点的具有相等或更高优先级的pod。

当在Preempt中被调用的时候,即发生抢占的时候,通过SelectVictimsOnNode函数选出需要被移除的pod,移除后然后将预调度的pod调度到该节点上。

podFitsOnNode基本流程如下:

  1. 遍历之前注册好的预选策略predicates.Ordering,并获取预选策略的执行函数。
  2. 遍历执行每个预选函数,并返回是否合适,预选失败的原因和错误。
  3. 如果预选函数执行的结果不合适,则加入预选失败的数组中。
  4. 最后返回预选失败的个数是否为0,和预选失败的原因。

入参:

  • pod
  • PredicateMetadata
  • NodeInfo
  • predicateFuncs
  • schedulercache.Cache
  • nodeCache
  • SchedulingQueue
  • alwaysCheckAllPredicates
  • equivClass

出参:

  • fit
  • PredicateFailureReason

完整代码如下:

此部分代码位于pkg/scheduler/core/generic_scheduler.go

// podFitsOnNode checks whether a node given by NodeInfo satisfies the given predicate functions.
// For given pod, podFitsOnNode will check if any equivalent pod exists and try to reuse its cached
// predicate results as possible.
// This function is called from two different places: Schedule and Preempt.
// When it is called from Schedule, we want to test whether the pod is schedulable
// on the node with all the existing pods on the node plus higher and equal priority
// pods nominated to run on the node.
// When it is called from Preempt, we should remove the victims of preemption and
// add the nominated pods. Removal of the victims is done by SelectVictimsOnNode().
// It removes victims from meta and NodeInfo before calling this function.
func podFitsOnNode(
	pod *v1.Pod,
	meta algorithm.PredicateMetadata,
	info *schedulercache.NodeInfo,
	predicateFuncs map[string]algorithm.FitPredicate,
	cache schedulercache.Cache,
	nodeCache *equivalence.NodeCache,
	queue SchedulingQueue,
	alwaysCheckAllPredicates bool,
	equivClass *equivalence.Class,
) (bool, []algorithm.PredicateFailureReason, error) {
	var (
		eCacheAvailable  bool
		failedPredicates []algorithm.PredicateFailureReason
	)

	podsAdded := false
	// We run predicates twice in some cases. If the node has greater or equal priority
	// nominated pods, we run them when those pods are added to meta and nodeInfo.
	// If all predicates succeed in this pass, we run them again when these
	// nominated pods are not added. This second pass is necessary because some
	// predicates such as inter-pod affinity may not pass without the nominated pods.
	// If there are no nominated pods for the node or if the first run of the
	// predicates fail, we don't run the second pass.
	// We consider only equal or higher priority pods in the first pass, because
	// those are the current "pod" must yield to them and not take a space opened
	// for running them. It is ok if the current "pod" take resources freed for
	// lower priority pods.
	// Requiring that the new pod is schedulable in both circumstances ensures that
	// we are making a conservative decision: predicates like resources and inter-pod
	// anti-affinity are more likely to fail when the nominated pods are treated
	// as running, while predicates like pod affinity are more likely to fail when
	// the nominated pods are treated as not running. We can't just assume the
	// nominated pods are running because they are not running right now and in fact,
	// they may end up getting scheduled to a different node.
	for i := 0; i < 2; i++ {
		metaToUse := meta
		nodeInfoToUse := info
		if i == 0 {
			podsAdded, metaToUse, nodeInfoToUse = addNominatedPods(util.GetPodPriority(pod), meta, info, queue)
		} else if !podsAdded || len(failedPredicates) != 0 {
			break
		}
		// Bypass eCache if node has any nominated pods.
		// TODO(bsalamat): consider using eCache and adding proper eCache invalidations
		// when pods are nominated or their nominations change.
		eCacheAvailable = equivClass != nil && nodeCache != nil && !podsAdded
		for _, predicateKey := range predicates.Ordering() {
			var (
				fit     bool
				reasons []algorithm.PredicateFailureReason
				err     error
			)
			//TODO (yastij) : compute average predicate restrictiveness to export it as Prometheus metric
			if predicate, exist := predicateFuncs[predicateKey]; exist {
				if eCacheAvailable {
					fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
				} else {
					fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
				}
				if err != nil {
					return false, []algorithm.PredicateFailureReason{}, err
				}

				if !fit {
					// eCache is available and valid, and predicates result is unfit, record the fail reasons
					failedPredicates = append(failedPredicates, reasons...)
					// if alwaysCheckAllPredicates is false, short circuit all predicates when one predicate fails.
					if !alwaysCheckAllPredicates {
						glog.V(5).Infoln("since alwaysCheckAllPredicates has not been set, the predicate " +
							"evaluation is short circuited and there are chances " +
							"of other predicates failing as well.")
						break
					}
				}
			}
		}
	}

	return len(failedPredicates) == 0, failedPredicates, nil
}

5.1. predicateFuncs

根据之前初注册好的预选策略函数来执行预选,判断节点是否符合调度。

for _, predicateKey := range predicates.Ordering() {
	if predicate, exist := predicateFuncs[predicateKey]; exist {
		if eCacheAvailable {
			fit, reasons, err = nodeCache.RunPredicate(predicate, predicateKey, pod, metaToUse, nodeInfoToUse, equivClass, cache)
		} else {
			fit, reasons, err = predicate(pod, metaToUse, nodeInfoToUse)
		}

预选策略如下:

var (
	predicatesOrdering = []string{CheckNodeConditionPred, CheckNodeUnschedulablePred,
		GeneralPred, HostNamePred, PodFitsHostPortsPred,
		MatchNodeSelectorPred, PodFitsResourcesPred, NoDiskConflictPred,
		PodToleratesNodeTaintsPred, PodToleratesNodeNoExecuteTaintsPred, CheckNodeLabelPresencePred,
		CheckServiceAffinityPred, MaxEBSVolumeCountPred, MaxGCEPDVolumeCountPred, MaxCSIVolumeCountPred,
		MaxAzureDiskVolumeCountPred, CheckVolumeBindingPred, NoVolumeZoneConflictPred,
		CheckNodeMemoryPressurePred, CheckNodePIDPressurePred, CheckNodeDiskPressurePred, MatchInterPodAffinityPred}
)

6. PodFitsResources

以下以PodFitsResources这个预选函数为例做分析,其他重要的预选函数待后续单独分析。

PodFitsResources用来检查一个节点是否有足够的资源来运行当前的pod,包括CPU、内存、GPU等。

PodFitsResources基本流程如下:

  1. 判断当前节点上pod总数加上预调度pod个数是否大于node的可分配pod总数,若是则不允许调度。
  2. 判断pod的request值是否都为0,若是则允许调度。
  3. 判断pod的request值加上当前node上所有pod的request值总和是否大于node的可分配资源,若是则不允许调度。
  4. 判断pod的拓展资源request值加上当前node上所有pod对应的request值总和是否大于node对应的可分配资源,若是则不允许调度。

PodFitsResources的注册代码如下:

factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)

PodFitsResources入参:

  • pod

  • nodeInfo

  • PredicateMetadata

PodFitsResources出参:

  • fit
  • PredicateFailureReason

PodFitsResources完整代码:

此部分的代码位于pkg/scheduler/algorithm/predicates/predicates.go

// PodFitsResources checks if a node has sufficient resources, such as cpu, memory, gpu, opaque int resources etc to run a pod.
// First return value indicates whether a node has sufficient resources to run a pod while the second return value indicates the
// predicate failure reasons if the node has insufficient resources to run the pod.
func PodFitsResources(pod *v1.Pod, meta algorithm.PredicateMetadata, nodeInfo *schedulercache.NodeInfo) (bool, []algorithm.PredicateFailureReason, error) {
	node := nodeInfo.Node()
	if node == nil {
		return false, nil, fmt.Errorf("node not found")
	}

	var predicateFails []algorithm.PredicateFailureReason
	allowedPodNumber := nodeInfo.AllowedPodNumber()
	if len(nodeInfo.Pods())+1 > allowedPodNumber {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
	}

	// No extended resources should be ignored by default.
	ignoredExtendedResources := sets.NewString()

	var podRequest *schedulercache.Resource
	if predicateMeta, ok := meta.(*predicateMetadata); ok {
		podRequest = predicateMeta.podRequest
		if predicateMeta.ignoredExtendedResources != nil {
			ignoredExtendedResources = predicateMeta.ignoredExtendedResources
		}
	} else {
		// We couldn't parse metadata - fallback to computing it.
		podRequest = GetResourceRequest(pod)
	}
	if podRequest.MilliCPU == 0 &&
		podRequest.Memory == 0 &&
		podRequest.EphemeralStorage == 0 &&
		len(podRequest.ScalarResources) == 0 {
		return len(predicateFails) == 0, predicateFails, nil
	}

	allocatable := nodeInfo.AllocatableResource()
	if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
	}
	if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
	}
	if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
		predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
	}

	for rName, rQuant := range podRequest.ScalarResources {
		if v1helper.IsExtendedResourceName(rName) {
			// If this resource is one of the extended resources that should be
			// ignored, we will skip checking it.
			if ignoredExtendedResources.Has(string(rName)) {
				continue
			}
		}
		if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
			predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
		}
	}

	if glog.V(10) {
		if len(predicateFails) == 0 {
			// We explicitly don't do glog.V(10).Infof() to avoid computing all the parameters if this is
			// not logged. There is visible performance gain from it.
			glog.Infof("Schedule Pod %+v on Node %+v is allowed, Node is running only %v out of %v Pods.",
				podName(pod), node.Name, len(nodeInfo.Pods()), allowedPodNumber)
		}
	}
	return len(predicateFails) == 0, predicateFails, nil
}

6.1. NodeInfo

NodeInfo是node的聚合信息,主要包括:

  • node:k8s node的结构体
  • pods:当前node上pod的数量
  • requestedResource:当前node上所有pod的request总和
  • allocatableResource:node的实际所有的可分配资源(对应于Node.Status.Allocatable.*),可理解为node的资源总量。

此部分代码位于pkg/scheduler/cache/node_info.go

// NodeInfo is node level aggregated information.
type NodeInfo struct {
	// Overall node information.
	node *v1.Node

	pods             []*v1.Pod
	podsWithAffinity []*v1.Pod
	usedPorts        util.HostPortInfo

	// Total requested resource of all pods on this node.
	// It includes assumed pods which scheduler sends binding to apiserver but
	// didn't get it as scheduled yet.
	requestedResource *Resource
	nonzeroRequest    *Resource
	// We store allocatedResources (which is Node.Status.Allocatable.*) explicitly
	// as int64, to avoid conversions and accessing map.
	allocatableResource *Resource

	// Cached taints of the node for faster lookup.
	taints    []v1.Taint
	taintsErr error

	// imageStates holds the entry of an image if and only if this image is on the node. The entry can be used for
	// checking an image's existence and advanced usage (e.g., image locality scheduling policy) based on the image
	// state information.
	imageStates map[string]*ImageStateSummary

	// TransientInfo holds the information pertaining to a scheduling cycle. This will be destructed at the end of
	// scheduling cycle.
	// TODO: @ravig. Remove this once we have a clear approach for message passing across predicates and priorities.
	TransientInfo *transientSchedulerInfo

	// Cached conditions of node for faster lookup.
	memoryPressureCondition v1.ConditionStatus
	diskPressureCondition   v1.ConditionStatus
	pidPressureCondition    v1.ConditionStatus

	// Whenever NodeInfo changes, generation is bumped.
	// This is used to avoid cloning it if the object didn't change.
	generation int64
}

6.2. Resource

Resource是可计算资源的集合体。主要包括:

  • MilliCPU
  • Memory
  • EphemeralStorage
  • AllowedPodNumber:允许的pod总数(对应于Node.Status.Allocatable.Pods().Value()),一般为110。
  • ScalarResources
// Resource is a collection of compute resource.
type Resource struct {
	MilliCPU         int64
	Memory           int64
	EphemeralStorage int64
	// We store allowedPodNumber (which is Node.Status.Allocatable.Pods().Value())
	// explicitly as int, to avoid conversions and improve performance.
	AllowedPodNumber int
	// ScalarResources
	ScalarResources map[v1.ResourceName]int64
}

以下分析podFitsOnNode的具体流程。

6.3. allowedPodNumber

首先获取节点的信息,先判断如果该节点当前所有的pod的个数加上当前预调度的pod是否会大于该节点允许的pod的总数,一般为110个。如果超过,则predicateFails数组增加1,即当前节点不适合该pod。

node := nodeInfo.Node()
if node == nil {
	return false, nil, fmt.Errorf("node not found")
}

var predicateFails []algorithm.PredicateFailureReason
allowedPodNumber := nodeInfo.AllowedPodNumber()
if len(nodeInfo.Pods())+1 > allowedPodNumber {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourcePods, 1, int64(len(nodeInfo.Pods())), int64(allowedPodNumber)))
	}

6.4. podRequest

如果podRequest都为0,则允许调度到该节点,直接返回结果。

if podRequest.MilliCPU == 0 &&
	podRequest.Memory == 0 &&
	podRequest.EphemeralStorage == 0 &&
	len(podRequest.ScalarResources) == 0 {
	return len(predicateFails) == 0, predicateFails, nil
}

6.5. AllocatableResource

如果当前预调度的pod的request资源加上当前node上所有pod的request总和大于该node的可分配资源总量,则不允许调度到该节点,直接返回结果。其中request资源包括CPU、内存、storage。

allocatable := nodeInfo.AllocatableResource()
if allocatable.MilliCPU < podRequest.MilliCPU+nodeInfo.RequestedResource().MilliCPU {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceCPU, podRequest.MilliCPU, nodeInfo.RequestedResource().MilliCPU, allocatable.MilliCPU))
}
if allocatable.Memory < podRequest.Memory+nodeInfo.RequestedResource().Memory {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceMemory, podRequest.Memory, nodeInfo.RequestedResource().Memory, allocatable.Memory))
}
if allocatable.EphemeralStorage < podRequest.EphemeralStorage+nodeInfo.RequestedResource().EphemeralStorage {
	predicateFails = append(predicateFails, NewInsufficientResourceError(v1.ResourceEphemeralStorage, podRequest.EphemeralStorage, nodeInfo.RequestedResource().EphemeralStorage, allocatable.EphemeralStorage))
	}

6.6. ScalarResources

判断其他拓展的标量资源,是否该pod的request值加上当前node上所有pod的对应资源的request总和大于该node上对应资源的可分配总量,如果是,则不允许调度到该节点。

for rName, rQuant := range podRequest.ScalarResources {
	if v1helper.IsExtendedResourceName(rName) {
		// If this resource is one of the extended resources that should be
		// ignored, we will skip checking it.
		if ignoredExtendedResources.Has(string(rName)) {
			continue
		}
	}
	if allocatable.ScalarResources[rName] < rQuant+nodeInfo.RequestedResource().ScalarResources[rName] {
		predicateFails = append(predicateFails, NewInsufficientResourceError(rName, podRequest.ScalarResources[rName], nodeInfo.RequestedResource().ScalarResources[rName], allocatable.ScalarResources[rName]))
	}
}

7. 总结

findNodesThatFit基于给定的预选函数过滤node,每个node传入到预选函数中来确实该节点是否符合要求。

findNodesThatFit的入参是被调度的pod和当前的节点列表,返回预选节点列表和错误。

findNodesThatFit基本流程如下:

  1. 设置可行节点的总数,作为预选节点数组的容量,避免总节点过多导致需要筛选的节点过多,效率低。
  2. 通过NodeTree不断获取下一个节点来判断该节点是否满足pod的调度条件。
  3. 通过之前注册的各种预选函数来判断当前节点是否符合pod的调度条件。
  4. 最后返回满足调度条件的node列表,供下一步的优选操作。

7.1. checkNode

checkNode是一个校验node是否符合要求的函数,其中实际调用到的核心函数是podFitsOnNode。再通过workqueue并发执行checkNode操作。

checkNode主要流程如下:

  1. 通过cache中的nodeTree不断获取下一个node。
  2. 将当前node和pod传入podFitsOnNode判断当前node是否符合要求。
  3. 如果当前node符合要求就将当前node加入预选节点的数组中filtered
  4. 如果当前node不满足要求,则加入到失败的数组中,并记录原因。
  5. 通过workqueue.ParallelizeUntil并发执行checkNode函数,一旦找到配置的可行节点数,就停止搜索更多节点。

7.2. podFitsOnNode

其中会调用到核心函数podFitsOnNode。

podFitsOnNode主要内容如下:

  • podFitsOnNode会检查给定的某个Node是否满足预选的函数。

  • 对于给定的pod,podFitsOnNode会检查是否有相同的pod存在,尽量复用缓存过的预选结果。

podFitsOnNode主要在Schedule(调度)和Preempt(抢占)的时候被调用。

当在Schedule中被调用的时候,主要判断是否可以被调度到当前节点,依据为当前节点上所有已存在的pod及被提名要运行到该节点的具有相等或更高优先级的pod。

当在Preempt中被调用的时候,即发生抢占的时候,通过SelectVictimsOnNode函数选出需要被移除的pod,移除后然后将预调度的pod调度到该节点上。

podFitsOnNode基本流程如下:

  1. 遍历之前注册好的预选策略predicates.Ordering,并获取预选策略的执行函数。
  2. 遍历执行每个预选函数,并返回是否合适,预选失败的原因和错误。
  3. 如果预选函数执行的结果不合适,则加入预选失败的数组中。
  4. 最后返回预选失败的个数是否为0,和预选失败的原因。

7.3. PodFitsResources

本文只示例分析了其中一个重要的预选函数:PodFitsResources

PodFitsResources用来检查一个节点是否有足够的资源来运行当前的pod,包括CPU、内存、GPU等。

PodFitsResources基本流程如下:

  1. 判断当前节点上pod总数加上预调度pod个数是否大于node的可分配pod总数,若是则不允许调度。
  2. 判断pod的request值是否都为0,若是则允许调度。
  3. 判断pod的request值加上当前node上所有pod的request值总和是否大于node的可分配资源,若是则不允许调度。
  4. 判断pod的拓展资源request值加上当前node上所有pod对应的request值总和是否大于node对应的可分配资源,若是则不允许调度。

参考:

11.4.2 -

kube-scheduler源码分析(一)之 NewSchedulerCommand

以下代码分析基于 kubernetes v1.12.0 版本。

scheduler的cmd代码目录结构如下:

kube-scheduler
├── BUILD
├── OWNERS
├── app            # app的目录下主要为运行scheduler相关的对象
│   ├── BUILD
│   ├── config      
│   │   ├── BUILD
│   │   └── config.go    # Scheduler的配置对象config
│   ├── options      # options主要记录 Scheduler 使用到的参数
│   │   ├── BUILD
│   │   ├── configfile.go
│   │   ├── deprecated.go
│   │   ├── deprecated_test.go
│   │   ├── insecure_serving.go
│   │   ├── insecure_serving_test.go
│   │   ├── options.go    # 主要包括Options、NewOptions、AddFlags、Config等函数
│   │   └── options_test.go
│   └── server.go    # 主要包括 NewSchedulerCommand、NewSchedulerConfig、Run等函数
└── scheduler.go     # main入口函数

1. Main函数

此部分的代码为/cmd/kube-scheduler/scheduler.go

kube-scheduler的入口函数Main函数,仍然是采用统一的代码风格,使用Cobra命令行框架。

func main() {
	rand.Seed(time.Now().UTC().UnixNano())

	command := app.NewSchedulerCommand()

	// TODO: once we switch everything over to Cobra commands, we can go back to calling
	// utilflag.InitFlags() (by removing its pflag.Parse() call). For now, we have to set the
	// normalize func and add the go flag set by hand.
	pflag.CommandLine.SetNormalizeFunc(utilflag.WordSepNormalizeFunc)
	pflag.CommandLine.AddGoFlagSet(goflag.CommandLine)
	// utilflag.InitFlags()
	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}
}

核心代码:

// 初始化scheduler命令结构体
command := app.NewSchedulerCommand()
// 执行Execute
err := command.Execute()

2. NewSchedulerCommand

此部分的代码为/cmd/kube-scheduler/app/server.go

NewSchedulerCommand主要用来构造和初始化SchedulerCommand结构体,

// NewSchedulerCommand creates a *cobra.Command object with default parameters
func NewSchedulerCommand() *cobra.Command {
	opts, err := options.NewOptions()
	if err != nil {
		glog.Fatalf("unable to initialize command options: %v", err)
	}

	cmd := &cobra.Command{
		Use: "kube-scheduler",
		Long: `The Kubernetes scheduler is a policy-rich, topology-aware,
workload-specific function that significantly impacts availability, performance,
and capacity. The scheduler needs to take into account individual and collective
resource requirements, quality of service requirements, hardware/software/policy
constraints, affinity and anti-affinity specifications, data locality, inter-workload
interference, deadlines, and so on. Workload-specific requirements will be exposed
through the API as necessary.`,
		Run: func(cmd *cobra.Command, args []string) {
			verflag.PrintAndExitIfRequested()
			utilflag.PrintFlags(cmd.Flags())

			if len(args) != 0 {
				fmt.Fprint(os.Stderr, "arguments are not supported\n")
			}

			if errs := opts.Validate(); len(errs) > 0 {
				fmt.Fprintf(os.Stderr, "%v\n", utilerrors.NewAggregate(errs))
				os.Exit(1)
			}

			if len(opts.WriteConfigTo) > 0 {
				if err := options.WriteConfigFile(opts.WriteConfigTo, &opts.ComponentConfig); err != nil {
					fmt.Fprintf(os.Stderr, "%v\n", err)
					os.Exit(1)
				}
				glog.Infof("Wrote configuration to: %s\n", opts.WriteConfigTo)
				return
			}

			c, err := opts.Config()
			if err != nil {
				fmt.Fprintf(os.Stderr, "%v\n", err)
				os.Exit(1)
			}

			stopCh := make(chan struct{})
			if err := Run(c.Complete(), stopCh); err != nil {
				fmt.Fprintf(os.Stderr, "%v\n", err)
				os.Exit(1)
			}
		},
	}

	opts.AddFlags(cmd.Flags())
	cmd.MarkFlagFilename("config", "yaml", "yml", "json")

	return cmd
}

核心代码:

// 构造option
opts, err := options.NewOptions()
// 初始化config对象
c, err := opts.Config()
// 执行run函数
err := Run(c.Complete(), stopCh)
// 添加参数
opts.AddFlags(cmd.Flags())

2.1. NewOptions

NewOptions主要用来构造SchedulerServer使用的参数和上下文,其中核心参数是KubeSchedulerConfiguration

opts, err := options.NewOptions()

NewOptions:

// NewOptions returns default scheduler app options.
func NewOptions() (*Options, error) {
	cfg, err := newDefaultComponentConfig()
	if err != nil {
		return nil, err
	}

	hhost, hport, err := splitHostIntPort(cfg.HealthzBindAddress)
	if err != nil {
		return nil, err
	}

	o := &Options{
		ComponentConfig: *cfg,
		SecureServing:   nil, // TODO: enable with apiserveroptions.NewSecureServingOptions()
		CombinedInsecureServing: &CombinedInsecureServingOptions{
			Healthz: &apiserveroptions.DeprecatedInsecureServingOptions{
				BindNetwork: "tcp",
			},
			Metrics: &apiserveroptions.DeprecatedInsecureServingOptions{
				BindNetwork: "tcp",
			},
			BindPort:    hport,
			BindAddress: hhost,
		},
		Authentication: nil, // TODO: enable with apiserveroptions.NewDelegatingAuthenticationOptions()
		Authorization:  nil, // TODO: enable with apiserveroptions.NewDelegatingAuthorizationOptions()
		Deprecated: &DeprecatedOptions{
			UseLegacyPolicyConfig:    false,
			PolicyConfigMapNamespace: metav1.NamespaceSystem,
		},
	}

	return o, nil
}

2.2. Options.Config

Config初始化调度器的配置对象。

c, err := opts.Config()

Config函数主要执行以下操作:

  • 构建scheduler client、leaderElectionClient、eventClient。
  • 创建event recorder
  • 设置leader选举
  • 创建informer对象,主要函数有NewSharedInformerFactoryNewPodInformer

Config具体代码如下:

// Config return a scheduler config object
func (o *Options) Config() (*schedulerappconfig.Config, error) {
	c := &schedulerappconfig.Config{}
	if err := o.ApplyTo(c); err != nil {
		return nil, err
	}

	// prepare kube clients.
	client, leaderElectionClient, eventClient, err := createClients(c.ComponentConfig.ClientConnection, o.Master, c.ComponentConfig.LeaderElection.RenewDeadline.Duration)
	if err != nil {
		return nil, err
	}

	// Prepare event clients.
	eventBroadcaster := record.NewBroadcaster()
	recorder := eventBroadcaster.NewRecorder(legacyscheme.Scheme, corev1.EventSource{Component: c.ComponentConfig.SchedulerName})

	// Set up leader election if enabled.
	var leaderElectionConfig *leaderelection.LeaderElectionConfig
	if c.ComponentConfig.LeaderElection.LeaderElect {
		leaderElectionConfig, err = makeLeaderElectionConfig(c.ComponentConfig.LeaderElection, leaderElectionClient, recorder)
		if err != nil {
			return nil, err
		}
	}

	c.Client = client
	c.InformerFactory = informers.NewSharedInformerFactory(client, 0)
	c.PodInformer = factory.NewPodInformer(client, 0)
	c.EventClient = eventClient
	c.Recorder = recorder
	c.Broadcaster = eventBroadcaster
	c.LeaderElection = leaderElectionConfig

	return c, nil
}

2.3. AddFlags

AddFlags为SchedulerServer添加指定的参数。

opts.AddFlags(cmd.Flags())

AddFlags函数的具体代码如下:

// AddFlags adds flags for the scheduler options.
func (o *Options) AddFlags(fs *pflag.FlagSet) {
	fs.StringVar(&o.ConfigFile, "config", o.ConfigFile, "The path to the configuration file. Flags override values in this file.")
	fs.StringVar(&o.WriteConfigTo, "write-config-to", o.WriteConfigTo, "If set, write the configuration values to this file and exit.")
	fs.StringVar(&o.Master, "master", o.Master, "The address of the Kubernetes API server (overrides any value in kubeconfig)")

	o.SecureServing.AddFlags(fs)
	o.CombinedInsecureServing.AddFlags(fs)
	o.Authentication.AddFlags(fs)
	o.Authorization.AddFlags(fs)
	o.Deprecated.AddFlags(fs, &o.ComponentConfig)

	leaderelectionconfig.BindFlags(&o.ComponentConfig.LeaderElection.LeaderElectionConfiguration, fs)
	utilfeature.DefaultFeatureGate.AddFlag(fs)
}

3. Run

此部分的代码为/cmd/kube-scheduler/app/server.go

err := Run(c.Complete(), stopCh)

Run运行一个不退出的常驻进程,来执行scheduler的相关操作。

Run函数的主要内容如下:

  • 通过scheduler config来创建scheduler的结构体。
  • 运行event broadcaster、healthz server、metrics server。
  • 运行所有的informer并在调度前等待cache的同步(重点)。
  • 执行sched.Run()来运行scheduler的调度逻辑。
  • 如果多个scheduler并开启了LeaderElect,则执行leader选举。

以下对重点代码分开分析:

3.1. NewSchedulerConfig

NewSchedulerConfig初始化SchedulerConfig(此部分具体逻辑待后续专门分析),最后初始化生成scheduler结构体。

// Build a scheduler config from the provided algorithm source.
schedulerConfig, err := NewSchedulerConfig(c)
if err != nil {
	return err
}

// Create the scheduler.
sched := scheduler.NewFromConfig(schedulerConfig)

3.2. InformerFactory.Start

运行PodInformer,并运行InformerFactory。此部分的逻辑为client-go的informer机制,在Informer机制中有详细分析。

// Start all informers.
go c.PodInformer.Informer().Run(stopCh)
c.InformerFactory.Start(stopCh)

3.3. WaitForCacheSync

在调度前等待cache同步。

// Wait for all caches to sync before scheduling.
c.InformerFactory.WaitForCacheSync(stopCh)
controller.WaitForCacheSync("scheduler", stopCh, c.PodInformer.Informer().HasSynced)

3.3.1. InformerFactory.WaitForCacheSync

InformerFactory.WaitForCacheSync等待所有启动的informer的cache进行同步,保持本地的store信息与etcd的信息是最新一致的。

// WaitForCacheSync waits for all started informers' cache were synced.
func (f *sharedInformerFactory) WaitForCacheSync(stopCh <-chan struct{}) map[reflect.Type]bool {
	informers := func() map[reflect.Type]cache.SharedIndexInformer {
		f.lock.Lock()
		defer f.lock.Unlock()

		informers := map[reflect.Type]cache.SharedIndexInformer{}
		for informerType, informer := range f.informers {
			if f.startedInformers[informerType] {
				informers[informerType] = informer
			}
		}
		return informers
	}()

	res := map[reflect.Type]bool{}
	for informType, informer := range informers {
		res[informType] = cache.WaitForCacheSync(stopCh, informer.HasSynced)
	}
	return res
}

接着调用 cache.WaitForCacheSync

// WaitForCacheSync waits for caches to populate.  It returns true if it was successful, false
// if the controller should shutdown
func WaitForCacheSync(stopCh <-chan struct{}, cacheSyncs ...InformerSynced) bool {
	err := wait.PollUntil(syncedPollPeriod,
		func() (bool, error) {
			for _, syncFunc := range cacheSyncs {
				if !syncFunc() {
					return false, nil
				}
			}
			return true, nil
		},
		stopCh)
	if err != nil {
		glog.V(2).Infof("stop requested")
		return false
	}

	glog.V(4).Infof("caches populated")
	return true
}

3.3.2. controller.WaitForCacheSync

controller.WaitForCacheSync是对cache.WaitForCacheSync的一层封装,通过不同的controller的名字来记录不同controller等待cache同步。

controller.WaitForCacheSync("scheduler", stop, s.PodInformer.Informer().HasSynced)

controller.WaitForCacheSync具体代码如下:

// WaitForCacheSync is a wrapper around cache.WaitForCacheSync that generates log messages
// indicating that the controller identified by controllerName is waiting for syncs, followed by
// either a successful or failed sync.
func WaitForCacheSync(controllerName string, stopCh <-chan struct{}, cacheSyncs ...cache.InformerSynced) bool {
	glog.Infof("Waiting for caches to sync for %s controller", controllerName)

	if !cache.WaitForCacheSync(stopCh, cacheSyncs...) {
		utilruntime.HandleError(fmt.Errorf("Unable to sync caches for %s controller", controllerName))
		return false
	}

	glog.Infof("Caches are synced for %s controller", controllerName)
	return true
}

3.4. LeaderElection

如果有多个scheduler,并开启leader选举,则运行LeaderElector直到选举结束或退出。

// If leader election is enabled, run via LeaderElector until done and exit.
if c.LeaderElection != nil {
	c.LeaderElection.Callbacks = leaderelection.LeaderCallbacks{
		OnStartedLeading: run,
		OnStoppedLeading: func() {
			utilruntime.HandleError(fmt.Errorf("lost master"))
		},
	}
	leaderElector, err := leaderelection.NewLeaderElector(*c.LeaderElection)
	if err != nil {
		return fmt.Errorf("couldn't create leader elector: %v", err)
	}

	leaderElector.Run(ctx)

	return fmt.Errorf("lost lease")
}

3.5. Scheduler.Run

// Prepare a reusable run function.
run := func(ctx context.Context) {
	sched.Run()
	<-ctx.Done()
}

ctx, cancel := context.WithCancel(context.TODO()) // TODO once Run() accepts a context, it should be used here
defer cancel()

go func() {
	select {
	case <-stopCh:
		cancel()
	case <-ctx.Done():
	}
}()
...
run(ctx)

Scheduler.Run先等待cache同步,然后开启调度逻辑的goroutine。

Scheduler.Run的具体代码如下:

// Run begins watching and scheduling. It waits for cache to be synced, then starts a goroutine and returns immediately.
func (sched *Scheduler) Run() {
	if !sched.config.WaitForCacheSync() {
		return
	}

	go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}

以上是对/cmd/kube-scheduler/scheduler.go部分代码的分析,Scheduler.Run后续的具体代码位于pkg/scheduler/scheduler.go待后续文章分析。

参考:

11.4.3 -

kube-scheduler源码分析(六)之 preempt

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析调度中的抢占逻辑,当pod不适合任何节点的时候,可能pod会调度失败,这时候可能会发生抢占。抢占逻辑的具体实现函数为Scheduler.preempt

1. 调用入口

当pod不适合任何节点的时候,可能pod会调度失败。这时候可能会发生抢占。

scheduleOne函数中关于抢占调用的逻辑如下:

此部分的代码位于/pkg/scheduler/scheduler.go

// scheduleOne does the entire scheduling workflow for a single pod.  It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne() {
	...
	suggestedHost, err := sched.schedule(pod)
	if err != nil {
		// schedule() may have failed because the pod would not fit on any host, so we try to
		// preempt, with the expectation that the next time the pod is tried for scheduling it
		// will fit due to the preemption. It is also possible that a different pod will schedule
		// into the resources that were preempted, but this is harmless.
		if fitError, ok := err.(*core.FitError); ok {
			preemptionStartTime := time.Now()
      // 执行抢占逻辑
			sched.preempt(pod, fitError)
			metrics.PreemptionAttempts.Inc()
			metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
			metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
		}
		return
	}
  ...
}  

其中核心代码为:

// 基于sched.schedule(pod)返回的err和当前待调度的pod执行抢占策略
sched.preempt(pod, fitError)

2. Scheduler.preempt

当pod调度失败的时候,会抢占低优先级pod的空间来给高优先级的pod。其中入参为调度失败的pod对象和调度失败的err。

抢占的基本流程如下:

  1. 判断是否有关闭抢占机制,如果关闭抢占机制则直接返回。
  2. 获取调度失败pod的最新对象数据。
  3. 执行抢占算法Algorithm.Preempt,返回预调度节点和需要被剔除的pod列表。
  4. 将抢占算法返回的node添加到pod的Status.NominatedNodeName中,并删除需要被剔除的pod。
  5. 当抢占算法返回的node是nil的时候,清除pod的Status.NominatedNodeName信息。

整个抢占流程的最终结果实际上是更新Pod.Status.NominatedNodeName属性的信息。如果抢占算法返回的节点不为空,则将该node更新到Pod.Status.NominatedNodeName中,否则就将Pod.Status.NominatedNodeName设置为空。

2.1. preempt

preempt的具体实现函数:

此部分的代码位于/pkg/scheduler/scheduler.go

// preempt tries to create room for a pod that has failed to schedule, by preempting lower priority pods if possible.
// If it succeeds, it adds the name of the node where preemption has happened to the pod annotations.
// It returns the node name and an error if any.
func (sched *Scheduler) preempt(preemptor *v1.Pod, scheduleErr error) (string, error) {
	if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
		glog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
			" No preemption is performed.")
		return "", nil
	}
	preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
	if err != nil {
		glog.Errorf("Error getting the updated preemptor pod object: %v", err)
		return "", err
	}

	node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr)
	metrics.PreemptionVictims.Set(float64(len(victims)))
	if err != nil {
		glog.Errorf("Error preempting victims to make room for %v/%v.", preemptor.Namespace, preemptor.Name)
		return "", err
	}
	var nodeName = ""
	if node != nil {
		nodeName = node.Name
		err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, nodeName)
		if err != nil {
			glog.Errorf("Error in preemption process. Cannot update pod %v/%v annotations: %v", preemptor.Namespace, preemptor.Name, err)
			return "", err
		}
		for _, victim := range victims {
			if err := sched.config.PodPreemptor.DeletePod(victim); err != nil {
				glog.Errorf("Error preempting pod %v/%v: %v", victim.Namespace, victim.Name, err)
				return "", err
			}
			sched.config.Recorder.Eventf(victim, v1.EventTypeNormal, "Preempted", "by %v/%v on node %v", preemptor.Namespace, preemptor.Name, nodeName)
		}
	}
	// Clearing nominated pods should happen outside of "if node != nil". Node could
	// be nil when a pod with nominated node name is eligible to preempt again,
	// but preemption logic does not find any node for it. In that case Preempt()
	// function of generic_scheduler.go returns the pod itself for removal of the annotation.
	for _, p := range nominatedPodsToClear {
		rErr := sched.config.PodPreemptor.RemoveNominatedNodeName(p)
		if rErr != nil {
			glog.Errorf("Cannot remove nominated node annotation of pod: %v", rErr)
			// We do not return as this error is not critical.
		}
	}
	return nodeName, err
}

以下对preempt的实现分段分析。

如果设置关闭抢占机制,则直接返回。

if !util.PodPriorityEnabled() || sched.config.DisablePreemption {
	glog.V(3).Infof("Pod priority feature is not enabled or preemption is disabled by scheduler configuration." +
		" No preemption is performed.")
	return "", nil
}

获取当前pod的最新状态。

preemptor, err := sched.config.PodPreemptor.GetUpdatedPod(preemptor)
if err != nil {
	glog.Errorf("Error getting the updated preemptor pod object: %v", err)
	return "", err
}

GetUpdatedPod的实现就是去拿pod的对象。

func (p *podPreemptor) GetUpdatedPod(pod *v1.Pod) (*v1.Pod, error) {
	return p.Client.CoreV1().Pods(pod.Namespace).Get(pod.Name, metav1.GetOptions{})
}

接着执行抢占的算法。抢占的算法返回预调度节点的信息和因抢占被剔除的pod的信息。具体的抢占算法逻辑下文分析。

node, victims, nominatedPodsToClear, err := sched.config.Algorithm.Preempt(preemptor, sched.config.NodeLister, scheduleErr)

将预调度节点的信息更新到pod的Status.NominatedNodeName属性中。

err = sched.config.PodPreemptor.SetNominatedNodeName(preemptor, nodeName)

SetNominatedNodeName的具体实现为:

func (p *podPreemptor) SetNominatedNodeName(pod *v1.Pod, nominatedNodeName string) error {
	podCopy := pod.DeepCopy()
	podCopy.Status.NominatedNodeName = nominatedNodeName
	_, err := p.Client.CoreV1().Pods(pod.Namespace).UpdateStatus(podCopy)
	return err
}

接着删除因抢占而需要被剔除的pod。

err := sched.config.PodPreemptor.DeletePod(victim)

PodPreemptor.DeletePod的具体实现就是删除具体的pod。

func (p *podPreemptor) DeletePod(pod *v1.Pod) error {
	return p.Client.CoreV1().Pods(pod.Namespace).Delete(pod.Name, &metav1.DeleteOptions{})
}

如果抢占算法得出的node对象为nil,则将pod的Status.NominatedNodeName属性设置为空。

// Clearing nominated pods should happen outside of "if node != nil". Node could
// be nil when a pod with nominated node name is eligible to preempt again,
// but preemption logic does not find any node for it. In that case Preempt()
// function of generic_scheduler.go returns the pod itself for removal of the annotation.
for _, p := range nominatedPodsToClear {
	rErr := sched.config.PodPreemptor.RemoveNominatedNodeName(p)
	if rErr != nil {
		glog.Errorf("Cannot remove nominated node annotation of pod: %v", rErr)
		// We do not return as this error is not critical.
	}
}

RemoveNominatedNodeName的具体实现如下:

func (p *podPreemptor) RemoveNominatedNodeName(pod *v1.Pod) error {
	if len(pod.Status.NominatedNodeName) == 0 {
		return nil
	}
	return p.SetNominatedNodeName(pod, "")
}

2.2. NominatedNodeName

Pod.Status.NominatedNodeName的说明:

nominatedNodeName是调度失败的pod抢占别的pod的时候,被抢占pod的运行节点。但在剔除被抢占pod之前该调度失败的pod不会被调度。同时也不保证最终该pod一定会调度到nominatedNodeName的机器上,也可能因为之后资源充足等原因调度到其他节点上。最终该pod会被加到调度的队列中。

其中加入到调度队列的具体过程如下:

func NewConfigFactory(args *ConfigFactoryArgs) scheduler.Configurator {
  ...
  	// unscheduled pod queue
	args.PodInformer.Informer().AddEventHandler(
			...
			Handler: cache.ResourceEventHandlerFuncs{
				AddFunc:    c.addPodToSchedulingQueue,
				UpdateFunc: c.updatePodInSchedulingQueue,
				DeleteFunc: c.deletePodFromSchedulingQueue,
			},
		},
	)
  ...
}  

addPodToSchedulingQueue:

func (c *configFactory) addPodToSchedulingQueue(obj interface{}) {
	if err := c.podQueue.Add(obj.(*v1.Pod)); err != nil {
		runtime.HandleError(fmt.Errorf("unable to queue %T: %v", obj, err))
	}
}

PriorityQueue.Add:

// Add adds a pod to the active queue. It should be called only when a new pod
// is added so there is no chance the pod is already in either queue.
func (p *PriorityQueue) Add(pod *v1.Pod) error {
	p.lock.Lock()
	defer p.lock.Unlock()
	err := p.activeQ.Add(pod)
	if err != nil {
		glog.Errorf("Error adding pod %v/%v to the scheduling queue: %v", pod.Namespace, pod.Name, err)
	} else {
		if p.unschedulableQ.get(pod) != nil {
			glog.Errorf("Error: pod %v/%v is already in the unschedulable queue.", pod.Namespace, pod.Name)
			p.deleteNominatedPodIfExists(pod)
			p.unschedulableQ.delete(pod)
		}
		p.addNominatedPodIfNeeded(pod)
		p.cond.Broadcast()
	}
	return err
}

addNominatedPodIfNeeded:

// addNominatedPodIfNeeded adds a pod to nominatedPods if it has a NominatedNodeName and it does not
// already exist in the map. Adding an existing pod is not going to update the pod.
func (p *PriorityQueue) addNominatedPodIfNeeded(pod *v1.Pod) {
	nnn := NominatedNodeName(pod)
	if len(nnn) > 0 {
		for _, np := range p.nominatedPods[nnn] {
			if np.UID == pod.UID {
				glog.Errorf("Pod %v/%v already exists in the nominated map!", pod.Namespace, pod.Name)
				return
			}
		}
		p.nominatedPods[nnn] = append(p.nominatedPods[nnn], pod)
	}
}

NominatedNodeName:

// NominatedNodeName returns nominated node name of a Pod.
func NominatedNodeName(pod *v1.Pod) string {
	return pod.Status.NominatedNodeName
}

3. genericScheduler.Preempt

抢占算法依然是在ScheduleAlgorithm接口中定义。

// ScheduleAlgorithm is an interface implemented by things that know how to schedule pods
// onto machines.
type ScheduleAlgorithm interface {
	Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
	// Preempt receives scheduling errors for a pod and tries to create room for
	// the pod by preempting lower priority pods if possible.
	// It returns the node where preemption happened, a list of preempted pods, a
	// list of pods whose nominated node name should be removed, and error if any.
	Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
	// Predicates() returns a pointer to a map of predicate functions. This is
	// exposed for testing.
	Predicates() map[string]FitPredicate
	// Prioritizers returns a slice of priority config. This is exposed for
	// testing.
	Prioritizers() []PriorityConfig
}

Preempt的具体实现为genericScheduler结构体。

Preempt的主要实现是找到可以调度的节点和上面因抢占而需要被剔除的pod。

基本流程如下:

  1. 根据调度失败的原因对所有节点先进行一批筛选,筛选出潜在的被调度节点列表。
  2. 通过selectNodesForPreemption筛选出需要牺牲的pod和其节点。
  3. 基于拓展抢占逻辑再次对上述筛选出来的牺牲者做过滤。
  4. 基于上述的过滤结果,选择一个最终可能因抢占被调度的节点。
  5. 基于上述的候选节点,找出该节点上优先级低于当前被调度pod的牺牲者pod列表。

完整代码如下:

此部分代码位于pkg/scheduler/core/generic_scheduler.go

// preempt finds nodes with pods that can be preempted to make room for "pod" to
// schedule. It chooses one of the nodes and preempts the pods on the node and
// returns 1) the node, 2) the list of preempted pods if such a node is found,
// 3) A list of pods whose nominated node name should be cleared, and 4) any
// possible error.
func (g *genericScheduler) Preempt(pod *v1.Pod, nodeLister algorithm.NodeLister, scheduleErr error) (*v1.Node, []*v1.Pod, []*v1.Pod, error) {
	// Scheduler may return various types of errors. Consider preemption only if
	// the error is of type FitError.
	fitError, ok := scheduleErr.(*FitError)
	if !ok || fitError == nil {
		return nil, nil, nil, nil
	}
	err := g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
	if err != nil {
		return nil, nil, nil, err
	}
	if !podEligibleToPreemptOthers(pod, g.cachedNodeInfoMap) {
		glog.V(5).Infof("Pod %v/%v is not eligible for more preemption.", pod.Namespace, pod.Name)
		return nil, nil, nil, nil
	}
	allNodes, err := nodeLister.List()
	if err != nil {
		return nil, nil, nil, err
	}
	if len(allNodes) == 0 {
		return nil, nil, nil, ErrNoNodesAvailable
	}
	potentialNodes := nodesWherePreemptionMightHelp(allNodes, fitError.FailedPredicates)
	if len(potentialNodes) == 0 {
		glog.V(3).Infof("Preemption will not help schedule pod %v/%v on any node.", pod.Namespace, pod.Name)
		// In this case, we should clean-up any existing nominated node name of the pod.
		return nil, nil, []*v1.Pod{pod}, nil
	}
	pdbs, err := g.cache.ListPDBs(labels.Everything())
	if err != nil {
		return nil, nil, nil, err
	}
  // 找出可能被抢占的节点
	nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,
		g.predicateMetaProducer, g.schedulingQueue, pdbs)
	if err != nil {
		return nil, nil, nil, err
	}

	// We will only check nodeToVictims with extenders that support preemption.
	// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
	// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
	nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
	if err != nil {
		return nil, nil, nil, err
	}
	// 选出最终被抢占的节点
	candidateNode := pickOneNodeForPreemption(nodeToVictims)
	if candidateNode == nil {
		return nil, nil, nil, err
	}

	// Lower priority pods nominated to run on this node, may no longer fit on
	// this node. So, we should remove their nomination. Removing their
	// nomination updates these pods and moves them to the active queue. It
	// lets scheduler find another place for them.
  // 找出被强占节点上牺牲者pod列表
	nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
	if nodeInfo, ok := g.cachedNodeInfoMap[candidateNode.Name]; ok {
		return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, err
	}

	return nil, nil, nil, fmt.Errorf(
		"preemption failed: the target node %s has been deleted from scheduler cache",
		candidateNode.Name)
}

以下对genericScheduler.Preempt分段进行分析。

3.1. selectNodesForPreemption

selectNodesForPreemption并行地所有节点中找可能被抢占的节点。

nodeToVictims, err := selectNodesForPreemption(pod, g.cachedNodeInfoMap, potentialNodes, g.predicates,g.predicateMetaProducer, g.schedulingQueue, pdbs)

selectNodesForPreemption主要基于selectVictimsOnNode构造一个checkNode的函数,然后并发执行该函数。

selectNodesForPreemption具体实现如下:

// selectNodesForPreemption finds all the nodes with possible victims for
// preemption in parallel.
func selectNodesForPreemption(pod *v1.Pod,
	nodeNameToInfo map[string]*schedulercache.NodeInfo,
	potentialNodes []*v1.Node,
	predicates map[string]algorithm.FitPredicate,
	metadataProducer algorithm.PredicateMetadataProducer,
	queue SchedulingQueue,
	pdbs []*policy.PodDisruptionBudget,
) (map[*v1.Node]*schedulerapi.Victims, error) {

	nodeToVictims := map[*v1.Node]*schedulerapi.Victims{}
	var resultLock sync.Mutex

	// We can use the same metadata producer for all nodes.
	meta := metadataProducer(pod, nodeNameToInfo)
	checkNode := func(i int) {
		nodeName := potentialNodes[i].Name
		var metaCopy algorithm.PredicateMetadata
		if meta != nil {
			metaCopy = meta.ShallowCopy()
		}
		pods, numPDBViolations, fits := selectVictimsOnNode(pod, metaCopy, nodeNameToInfo[nodeName], predicates, queue, pdbs)
		if fits {
			resultLock.Lock()
			victims := schedulerapi.Victims{
				Pods:             pods,
				NumPDBViolations: numPDBViolations,
			}
			nodeToVictims[potentialNodes[i]] = &victims
			resultLock.Unlock()
		}
	}
	workqueue.Parallelize(16, len(potentialNodes), checkNode)
	return nodeToVictims, nil
}

3.1.1. selectVictimsOnNode

selectVictimsOnNode找到应该被抢占的给定节点上的最小pod集合,以便给调度失败的pod安排足够的空间。该函数最终返回的是一个pod的数组。当有更低优先级的pod可能被选择的时候,较高优先级的pod不会被选入该待剔除的pod集合。

基本流程如下:

  1. 先检查当该节点上所有低于预被调度pod优先级的pod移除后,该pod能否被调度到当前节点上。
  2. 如果上述检查可以,则将该节点的所有低优先级pod按照优先级来排序。
// selectVictimsOnNode finds minimum set of pods on the given node that should
// be preempted in order to make enough room for "pod" to be scheduled. The
// minimum set selected is subject to the constraint that a higher-priority pod
// is never preempted when a lower-priority pod could be (higher/lower relative
// to one another, not relative to the preemptor "pod").
// The algorithm first checks if the pod can be scheduled on the node when all the
// lower priority pods are gone. If so, it sorts all the lower priority pods by
// their priority and then puts them into two groups of those whose PodDisruptionBudget
// will be violated if preempted and other non-violating pods. Both groups are
// sorted by priority. It first tries to reprieve as many PDB violating pods as
// possible and then does them same for non-PDB-violating pods while checking
// that the "pod" can still fit on the node.
// NOTE: This function assumes that it is never called if "pod" cannot be scheduled
// due to pod affinity, node affinity, or node anti-affinity reasons. None of
// these predicates can be satisfied by removing more pods from the node.
func selectVictimsOnNode(
	pod *v1.Pod,
	meta algorithm.PredicateMetadata,
	nodeInfo *schedulercache.NodeInfo,
	fitPredicates map[string]algorithm.FitPredicate,
	queue SchedulingQueue,
	pdbs []*policy.PodDisruptionBudget,
) ([]*v1.Pod, int, bool) {
	potentialVictims := util.SortableList{CompFunc: util.HigherPriorityPod}
	nodeInfoCopy := nodeInfo.Clone()

	removePod := func(rp *v1.Pod) {
		nodeInfoCopy.RemovePod(rp)
		if meta != nil {
			meta.RemovePod(rp)
		}
	}
	addPod := func(ap *v1.Pod) {
		nodeInfoCopy.AddPod(ap)
		if meta != nil {
			meta.AddPod(ap, nodeInfoCopy)
		}
	}
	// As the first step, remove all the lower priority pods from the node and
	// check if the given pod can be scheduled.
	podPriority := util.GetPodPriority(pod)
	for _, p := range nodeInfoCopy.Pods() {
		if util.GetPodPriority(p) < podPriority {
			potentialVictims.Items = append(potentialVictims.Items, p)
			removePod(p)
		}
	}
	potentialVictims.Sort()
	// If the new pod does not fit after removing all the lower priority pods,
	// we are almost done and this node is not suitable for preemption. The only condition
	// that we should check is if the "pod" is failing to schedule due to pod affinity
	// failure.
	// TODO(bsalamat): Consider checking affinity to lower priority pods if feasible with reasonable performance.
	if fits, _, err := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, nil, nil, queue, false, nil); !fits {
		if err != nil {
			glog.Warningf("Encountered error while selecting victims on node %v: %v", nodeInfo.Node().Name, err)
		}
		return nil, 0, false
	}
	var victims []*v1.Pod
	numViolatingVictim := 0
	// Try to reprieve as many pods as possible. We first try to reprieve the PDB
	// violating victims and then other non-violating ones. In both cases, we start
	// from the highest priority victims.
	violatingVictims, nonViolatingVictims := filterPodsWithPDBViolation(potentialVictims.Items, pdbs)
	reprievePod := func(p *v1.Pod) bool {
		addPod(p)
		fits, _, _ := podFitsOnNode(pod, meta, nodeInfoCopy, fitPredicates, nil, nil, queue, false, nil)
		if !fits {
			removePod(p)
			victims = append(victims, p)
			glog.V(5).Infof("Pod %v is a potential preemption victim on node %v.", p.Name, nodeInfo.Node().Name)
		}
		return fits
	}
	for _, p := range violatingVictims {
		if !reprievePod(p) {
			numViolatingVictim++
		}
	}
	// Now we try to reprieve non-violating victims.
	for _, p := range nonViolatingVictims {
		reprievePod(p)
	}
	return victims, numViolatingVictim, true
}

3.2. processPreemptionWithExtenders

processPreemptionWithExtenders基于selectNodesForPreemption选出的牺牲者进行扩展的抢占逻辑继续筛选牺牲者。

// We will only check nodeToVictims with extenders that support preemption.
// Extenders which do not support preemption may later prevent preemptor from being scheduled on the nominated
// node. In that case, scheduler will find a different host for the preemptor in subsequent scheduling cycles.
nodeToVictims, err = g.processPreemptionWithExtenders(pod, nodeToVictims)
if err != nil {
	return nil, nil, nil, err
}

processPreemptionWithExtenders完整代码如下:

// processPreemptionWithExtenders processes preemption with extenders
func (g *genericScheduler) processPreemptionWithExtenders(
	pod *v1.Pod,
	nodeToVictims map[*v1.Node]*schedulerapi.Victims,
) (map[*v1.Node]*schedulerapi.Victims, error) {
	if len(nodeToVictims) > 0 {
		for _, extender := range g.extenders {
			if extender.SupportsPreemption() && extender.IsInterested(pod) {
				newNodeToVictims, err := extender.ProcessPreemption(
					pod,
					nodeToVictims,
					g.cachedNodeInfoMap,
				)
				if err != nil {
					if extender.IsIgnorable() {
						glog.Warningf("Skipping extender %v as it returned error %v and has ignorable flag set",
							extender, err)
						continue
					}
					return nil, err
				}

				// Replace nodeToVictims with new result after preemption. So the
				// rest of extenders can continue use it as parameter.
				nodeToVictims = newNodeToVictims

				// If node list becomes empty, no preemption can happen regardless of other extenders.
				if len(nodeToVictims) == 0 {
					break
				}
			}
		}
	}

	return nodeToVictims, nil
}

3.3. pickOneNodeForPreemption

pickOneNodeForPreemption从筛选出的node中再挑选一个节点作为最终调度节点。

candidateNode := pickOneNodeForPreemption(nodeToVictims)
if candidateNode == nil {
	return nil, nil, nil, err
}

pickOneNodeForPreemption完整代码如下:

// pickOneNodeForPreemption chooses one node among the given nodes. It assumes
// pods in each map entry are ordered by decreasing priority.
// It picks a node based on the following criteria:
// 1. A node with minimum number of PDB violations.
// 2. A node with minimum highest priority victim is picked.
// 3. Ties are broken by sum of priorities of all victims.
// 4. If there are still ties, node with the minimum number of victims is picked.
// 5. If there are still ties, the first such node is picked (sort of randomly).
// The 'minNodes1' and 'minNodes2' are being reused here to save the memory
// allocation and garbage collection time.
func pickOneNodeForPreemption(nodesToVictims map[*v1.Node]*schedulerapi.Victims) *v1.Node {
	if len(nodesToVictims) == 0 {
		return nil
	}
	minNumPDBViolatingPods := math.MaxInt32
	var minNodes1 []*v1.Node
	lenNodes1 := 0
	for node, victims := range nodesToVictims {
		if len(victims.Pods) == 0 {
			// We found a node that doesn't need any preemption. Return it!
			// This should happen rarely when one or more pods are terminated between
			// the time that scheduler tries to schedule the pod and the time that
			// preemption logic tries to find nodes for preemption.
			return node
		}
		numPDBViolatingPods := victims.NumPDBViolations
		if numPDBViolatingPods < minNumPDBViolatingPods {
			minNumPDBViolatingPods = numPDBViolatingPods
			minNodes1 = nil
			lenNodes1 = 0
		}
		if numPDBViolatingPods == minNumPDBViolatingPods {
			minNodes1 = append(minNodes1, node)
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}

	// There are more than one node with minimum number PDB violating pods. Find
	// the one with minimum highest priority victim.
	minHighestPriority := int32(math.MaxInt32)
	var minNodes2 = make([]*v1.Node, lenNodes1)
	lenNodes2 := 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		victims := nodesToVictims[node]
		// highestPodPriority is the highest priority among the victims on this node.
		highestPodPriority := util.GetPodPriority(victims.Pods[0])
		if highestPodPriority < minHighestPriority {
			minHighestPriority = highestPodPriority
			lenNodes2 = 0
		}
		if highestPodPriority == minHighestPriority {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	if lenNodes2 == 1 {
		return minNodes2[0]
	}

	// There are a few nodes with minimum highest priority victim. Find the
	// smallest sum of priorities.
	minSumPriorities := int64(math.MaxInt64)
	lenNodes1 = 0
	for i := 0; i < lenNodes2; i++ {
		var sumPriorities int64
		node := minNodes2[i]
		for _, pod := range nodesToVictims[node].Pods {
			// We add MaxInt32+1 to all priorities to make all of them >= 0. This is
			// needed so that a node with a few pods with negative priority is not
			// picked over a node with a smaller number of pods with the same negative
			// priority (and similar scenarios).
			sumPriorities += int64(util.GetPodPriority(pod)) + int64(math.MaxInt32+1)
		}
		if sumPriorities < minSumPriorities {
			minSumPriorities = sumPriorities
			lenNodes1 = 0
		}
		if sumPriorities == minSumPriorities {
			minNodes1[lenNodes1] = node
			lenNodes1++
		}
	}
	if lenNodes1 == 1 {
		return minNodes1[0]
	}

	// There are a few nodes with minimum highest priority victim and sum of priorities.
	// Find one with the minimum number of pods.
	minNumPods := math.MaxInt32
	lenNodes2 = 0
	for i := 0; i < lenNodes1; i++ {
		node := minNodes1[i]
		numPods := len(nodesToVictims[node].Pods)
		if numPods < minNumPods {
			minNumPods = numPods
			lenNodes2 = 0
		}
		if numPods == minNumPods {
			minNodes2[lenNodes2] = node
			lenNodes2++
		}
	}
	// At this point, even if there are more than one node with the same score,
	// return the first one.
	if lenNodes2 > 0 {
		return minNodes2[0]
	}
	glog.Errorf("Error in logic of node scoring for preemption. We should never reach here!")
	return nil
}

3.4. getLowerPriorityNominatedPods

getLowerPriorityNominatedPods的基本流程如下:

  1. 获取候选节点上的pod列表。
  2. 获取待调度pod的优先级值。
  3. 遍历该节点的pod列表,如果低于待调度pod的优先级则放入低优先级pod列表中。

genericScheduler.Preempt中相关代码如下:

// Lower priority pods nominated to run on this node, may no longer fit on
// this node. So, we should remove their nomination. Removing their
// nomination updates these pods and moves them to the active queue. It
// lets scheduler find another place for them.
nominatedPods := g.getLowerPriorityNominatedPods(pod, candidateNode.Name)
if nodeInfo, ok := g.cachedNodeInfoMap[candidateNode.Name]; ok {
	return nodeInfo.Node(), nodeToVictims[candidateNode].Pods, nominatedPods, err
}

getLowerPriorityNominatedPods代码如下:

此部分代码位于pkg/scheduler/core/generic_scheduler.go

// getLowerPriorityNominatedPods returns pods whose priority is smaller than the
// priority of the given "pod" and are nominated to run on the given node.
// Note: We could possibly check if the nominated lower priority pods still fit
// and return those that no longer fit, but that would require lots of
// manipulation of NodeInfo and PredicateMeta per nominated pod. It may not be
// worth the complexity, especially because we generally expect to have a very
// small number of nominated pods per node.
func (g *genericScheduler) getLowerPriorityNominatedPods(pod *v1.Pod, nodeName string) []*v1.Pod {
	pods := g.schedulingQueue.WaitingPodsForNode(nodeName)

	if len(pods) == 0 {
		return nil
	}

	var lowerPriorityPods []*v1.Pod
	podPriority := util.GetPodPriority(pod)
	for _, p := range pods {
		if util.GetPodPriority(p) < podPriority {
			lowerPriorityPods = append(lowerPriorityPods, p)
		}
	}
	return lowerPriorityPods
}

4. 总结

4.1. Scheduler.preempt

当pod调度失败的时候,会抢占低优先级pod的空间来给高优先级的pod。其中入参为调度失败的pod对象和调度失败的err。

抢占的基本流程如下:

  1. 判断是否有关闭抢占机制,如果关闭抢占机制则直接返回。
  2. 获取调度失败pod的最新对象数据。
  3. 执行抢占算法Algorithm.Preempt,返回预调度节点和需要被剔除的pod列表。
  4. 将抢占算法返回的node添加到pod的Status.NominatedNodeName中,并删除需要被剔除的pod。
  5. 当抢占算法返回的node是nil的时候,清除pod的Status.NominatedNodeName信息。

整个抢占流程的最终结果实际上是更新Pod.Status.NominatedNodeName属性的信息。如果抢占算法返回的节点不为空,则将该node更新到Pod.Status.NominatedNodeName中,否则就将Pod.Status.NominatedNodeName设置为空。

4.2. genericScheduler.Preempt

Preempt的主要实现是找到可以调度的节点和上面因抢占而需要被剔除的pod。

基本流程如下:

  1. 根据调度失败的原因对所有节点先进行一批筛选,筛选出潜在的被调度节点列表。
  2. 通过selectNodesForPreemption筛选出需要牺牲的pod和其节点。
  3. 基于拓展抢占逻辑再次对上述筛选出来的牺牲者做过滤。
  4. 基于上述的过滤结果,选择一个最终可能因抢占被调度的节点。
  5. 基于上述的候选节点,找出该节点上优先级低于当前被调度pod的牺牲者pod列表。

参考:

11.4.4 -

kube-scheduler源码分析(五)之 PrioritizeNodes

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析优选策略逻辑,即从预选的节点中选择出最优的节点。优选策略的具体实现函数为PrioritizeNodesPrioritizeNodes最终返回是一个记录了各个节点分数的列表。

1. 调用入口

genericScheduler.Schedule中对PrioritizeNodes的调用过程如下:

此部分代码位于pkg/scheduler/core/generic_scheduler.go

func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
  ...
	trace.Step("Prioritizing")
	startPriorityEvalTime := time.Now()
	// When only one node after predicate, just use it.
	if len(filteredNodes) == 1 {
		metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
		return filteredNodes[0].Name, nil
	}

	metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
  // 执行优选逻辑的操作,返回记录各个节点分数的列表
	priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
	if err != nil {
		return "", err
	}
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))
  ...
}  

核心代码:

// 基于预选节点filteredNodes进一步筛选优选的节点,返回记录各个节点分数的列表
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)

2. PrioritizeNodes

优选,从满足的节点中选择出最优的节点。PrioritizeNodes最终返回是一个记录了各个节点分数的列表。

具体操作如下:

  • PrioritizeNodes通过并行运行各个优先级函数来对节点进行优先级排序。
  • 每个优先级函数会给节点打分,打分范围为0-10分。
  • 0 表示优先级最低的节点,10表示优先级最高的节点。
  • 每个优先级函数也有各自的权重。
  • 优先级函数返回的节点分数乘以权重以获得加权分数。
  • 最后组合(添加)所有分数以获得所有节点的总加权分数。

PrioritizeNodes主要流程如下:

  1. 如果没有设置优选函数和拓展函数,则全部节点设置相同的分数,直接返回。
  2. 依次给node执行map函数进行打分。
  3. 再对上述map函数的执行结果执行reduce函数计算最终得分。
  4. 最后根据不同优先级函数的权重对得分取加权平均数。

入参:

  • pod
  • nodeNameToInfo
  • meta interface{},
  • priorityConfigs
  • nodes
  • extenders

出参:

  • HostPriorityList:记录节点分数的列表。

HostPriority定义如下:

// HostPriority represents the priority of scheduling to a particular host, higher priority is better.
type HostPriority struct {
	// Name of the host
	Host string
	// Score associated with the host
	Score int
}

PrioritizeNodes完整代码如下:

此部分代码位于pkg/scheduler/core/generic_scheduler.go

// PrioritizeNodes prioritizes the nodes by running the individual priority functions in parallel.
// Each priority function is expected to set a score of 0-10
// 0 is the lowest priority score (least preferred node) and 10 is the highest
// Each priority function can also have its own weight
// The node scores returned by the priority function are multiplied by the weights to get weighted scores
// All scores are finally combined (added) to get the total weighted scores of all nodes
func PrioritizeNodes(
	pod *v1.Pod,
	nodeNameToInfo map[string]*schedulercache.NodeInfo,
	meta interface{},
	priorityConfigs []algorithm.PriorityConfig,
	nodes []*v1.Node,
	extenders []algorithm.SchedulerExtender,
) (schedulerapi.HostPriorityList, error) {
	// If no priority configs are provided, then the EqualPriority function is applied
	// This is required to generate the priority list in the required format
	if len(priorityConfigs) == 0 && len(extenders) == 0 {
		result := make(schedulerapi.HostPriorityList, 0, len(nodes))
		for i := range nodes {
			hostPriority, err := EqualPriorityMap(pod, meta, nodeNameToInfo[nodes[i].Name])
			if err != nil {
				return nil, err
			}
			result = append(result, hostPriority)
		}
		return result, nil
	}

	var (
		mu   = sync.Mutex{}
		wg   = sync.WaitGroup{}
		errs []error
	)
	appendError := func(err error) {
		mu.Lock()
		defer mu.Unlock()
		errs = append(errs, err)
	}

	results := make([]schedulerapi.HostPriorityList, len(priorityConfigs), len(priorityConfigs))

	for i, priorityConfig := range priorityConfigs {
		if priorityConfig.Function != nil {
			// DEPRECATED
			wg.Add(1)
			go func(index int, config algorithm.PriorityConfig) {
				defer wg.Done()
				var err error
				results[index], err = config.Function(pod, nodeNameToInfo, nodes)
				if err != nil {
					appendError(err)
				}
			}(i, priorityConfig)
		} else {
			results[i] = make(schedulerapi.HostPriorityList, len(nodes))
		}
	}
	processNode := func(index int) {
		nodeInfo := nodeNameToInfo[nodes[index].Name]
		var err error
		for i := range priorityConfigs {
			if priorityConfigs[i].Function != nil {
				continue
			}
			results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)
			if err != nil {
				appendError(err)
				return
			}
		}
	}
	workqueue.Parallelize(16, len(nodes), processNode)
	for i, priorityConfig := range priorityConfigs {
		if priorityConfig.Reduce == nil {
			continue
		}
		wg.Add(1)
		go func(index int, config algorithm.PriorityConfig) {
			defer wg.Done()
			if err := config.Reduce(pod, meta, nodeNameToInfo, results[index]); err != nil {
				appendError(err)
			}
			if glog.V(10) {
				for _, hostPriority := range results[index] {
					glog.Infof("%v -> %v: %v, Score: (%d)", pod.Name, hostPriority.Host, config.Name, hostPriority.Score)
				}
			}
		}(i, priorityConfig)
	}
	// Wait for all computations to be finished.
	wg.Wait()
	if len(errs) != 0 {
		return schedulerapi.HostPriorityList{}, errors.NewAggregate(errs)
	}

	// Summarize all scores.
	result := make(schedulerapi.HostPriorityList, 0, len(nodes))

	for i := range nodes {
		result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0})
		for j := range priorityConfigs {
			result[i].Score += results[j][i].Score * priorityConfigs[j].Weight
		}
	}

	if len(extenders) != 0 && nodes != nil {
		combinedScores := make(map[string]int, len(nodeNameToInfo))
		for _, extender := range extenders {
			if !extender.IsInterested(pod) {
				continue
			}
			wg.Add(1)
			go func(ext algorithm.SchedulerExtender) {
				defer wg.Done()
				prioritizedList, weight, err := ext.Prioritize(pod, nodes)
				if err != nil {
					// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
					return
				}
				mu.Lock()
				for i := range *prioritizedList {
					host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
					combinedScores[host] += score * weight
				}
				mu.Unlock()
			}(extender)
		}
		// wait for all go routines to finish
		wg.Wait()
		for i := range result {
			result[i].Score += combinedScores[result[i].Host]
		}
	}

	if glog.V(10) {
		for i := range result {
			glog.V(10).Infof("Host %s => Score %d", result[i].Host, result[i].Score)
		}
	}
	return result, nil
}

以下对PrioritizeNodes分段进行分析。

3. EqualPriorityMap

如果没有提供优选函数和拓展函数,则将所有的节点设置为相同的优先级,即节点的score都为1,然后直接返回结果。(但一般情况下优选函数列表都不为空)

// If no priority configs are provided, then the EqualPriority function is applied
// This is required to generate the priority list in the required format
if len(priorityConfigs) == 0 && len(extenders) == 0 {
	result := make(schedulerapi.HostPriorityList, 0, len(nodes))
	for i := range nodes {
		hostPriority, err := EqualPriorityMap(pod, meta, nodeNameToInfo[nodes[i].Name])
		if err != nil {
			return nil, err
		}
		result = append(result, hostPriority)
	}
	return result, nil
}

EqualPriorityMap具体实现如下:

// EqualPriorityMap is a prioritizer function that gives an equal weight of one to all nodes
func EqualPriorityMap(_ *v1.Pod, _ interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
	node := nodeInfo.Node()
	if node == nil {
		return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
	}
	return schedulerapi.HostPriority{
		Host:  node.Name,
		Score: 1,
	}, nil
}

4. processNode

processNode就是基于index拿出node的信息,调用之前注册的各种优选函数(此处是mapFunction),通过优选函数对node和pod进行处理,最后返回一个记录node分数的列表resultprocessNode同样也使用workqueue.Parallelize来进行并行处理。(processNode类似于预选逻辑findNodesThatFit中使用到的checkNode的作用)

其中优选函数是通过priorityConfigs来记录,每类优选函数包括PriorityMapFunctionPriorityReduceFunction两种函数。优选函数的注册部分可参考registerAlgorithmProvider

processNode := func(index int) {
	nodeInfo := nodeNameToInfo[nodes[index].Name]
	var err error
	for i := range priorityConfigs {
		if priorityConfigs[i].Function != nil {
			continue
		}
		results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)
		if err != nil {
			appendError(err)
			return
		}
	}
}
// 并行执行processNode
workqueue.Parallelize(16, len(nodes), processNode)

priorityConfigs定义如下:

核心属性:

  • Map :PriorityMapFunction
  • Reduce:PriorityReduceFunction
// PriorityConfig is a config used for a priority function.
type PriorityConfig struct {
	Name   string
	Map    PriorityMapFunction   
	Reduce PriorityReduceFunction
	// TODO: Remove it after migrating all functions to
	// Map-Reduce pattern.
	Function PriorityFunction
	Weight   int
}

具体的优选函数处理逻辑待下文分析,本文会以NewSelectorSpreadPriority函数为例。

5. PriorityMapFunction

PriorityMapFunction是一个计算给定节点的每个节点结果的函数。

PriorityMapFunction定义如下:

// PriorityMapFunction is a function that computes per-node results for a given node.
// TODO: Figure out the exact API of this method.
// TODO: Change interface{} to a specific type.
type PriorityMapFunction func(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error)

PriorityMapFunction是在processNode中调用的,代码如下:

results[i][index], err = priorityConfigs[i].Map(pod, meta, nodeInfo)

下文会分析NewSelectorSpreadPriority在的map函数CalculateSpreadPriorityMap

6. PriorityReduceFunction

PriorityReduceFunction是一个聚合每个节点结果并计算所有节点的最终得分的函数。

PriorityReduceFunction定义如下:

// PriorityReduceFunction is a function that aggregated per-node results and computes
// final scores for all nodes.
// TODO: Figure out the exact API of this method.
// TODO: Change interface{} to a specific type.
type PriorityReduceFunction func(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulercache.NodeInfo, result schedulerapi.HostPriorityList) error

PrioritizeNodes中对reduce函数调用部分如下:

for i, priorityConfig := range priorityConfigs {
	if priorityConfig.Reduce == nil {
		continue
	}
	wg.Add(1)
	go func(index int, config algorithm.PriorityConfig) {
		defer wg.Done()
		if err := config.Reduce(pod, meta, nodeNameToInfo, results[index]); err != nil {
			appendError(err)
		}
		if glog.V(10) {
			for _, hostPriority := range results[index] {
				glog.Infof("%v -> %v: %v, Score: (%d)", pod.Name, hostPriority.Host, config.Name, hostPriority.Score)
			}
		}
	}(i, priorityConfig)
}

下文会分析NewSelectorSpreadPriority在的reduce函数CalculateSpreadPriorityReduce

7. Summarize all scores

先等待计算完成再计算加权平均数。

// Wait for all computations to be finished.
wg.Wait()
if len(errs) != 0 {
	return schedulerapi.HostPriorityList{}, errors.NewAggregate(errs)
}

计算所有节点的加权平均数。

// Summarize all scores.
result := make(schedulerapi.HostPriorityList, 0, len(nodes))

for i := range nodes {
	result = append(result, schedulerapi.HostPriority{Host: nodes[i].Name, Score: 0})
	for j := range priorityConfigs {
		result[i].Score += results[j][i].Score * priorityConfigs[j].Weight
	}
}

当设置了拓展的计算方式,则增加拓展计算方式的加权平均数。

if len(extenders) != 0 && nodes != nil {
	combinedScores := make(map[string]int, len(nodeNameToInfo))
	for _, extender := range extenders {
		if !extender.IsInterested(pod) {
			continue
		}
		wg.Add(1)
		go func(ext algorithm.SchedulerExtender) {
			defer wg.Done()
			prioritizedList, weight, err := ext.Prioritize(pod, nodes)
			if err != nil {
				// Prioritization errors from extender can be ignored, let k8s/other extenders determine the priorities
				return
			}
			mu.Lock()
			for i := range *prioritizedList {
				host, score := (*prioritizedList)[i].Host, (*prioritizedList)[i].Score
				combinedScores[host] += score * weight
			}
			mu.Unlock()
		}(extender)
	}
	// wait for all go routines to finish
	wg.Wait()
	for i := range result {
		result[i].Score += combinedScores[result[i].Host]
	}
}

8. NewSelectorSpreadPriority

以下以NewSelectorSpreadPriority这个优选函数来做分析,其他重要的优选函数待后续专门分析。

NewSelectorSpreadPriority主要的功能是将属于相同service和rs下的pod尽量分布在不同的node上。

该函数的注册代码如下:

此部分代码位于pkg/scheduler/algorithmprovider/defaults/defaults.go

// ServiceSpreadingPriority is a priority config factory that spreads pods by minimizing
// the number of pods (belonging to the same service) on the same node.
// Register the factory so that it's available, but do not include it as part of the default priorities
// Largely replaced by "SelectorSpreadPriority", but registered for backward compatibility with 1.0
factory.RegisterPriorityConfigFactory(
	"ServiceSpreadingPriority",
	factory.PriorityConfigFactory{
		MapReduceFunction: func(args factory.PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
			return priorities.NewSelectorSpreadPriority(args.ServiceLister, algorithm.EmptyControllerLister{}, algorithm.EmptyReplicaSetLister{}, algorithm.EmptyStatefulSetLister{})
		},
		Weight: 1,
	},
)

NewSelectorSpreadPriority的具体实现如下:

此部分代码位于pkg/scheduler/algorithm/priorities/selector_spreading.go

// NewSelectorSpreadPriority creates a SelectorSpread.
func NewSelectorSpreadPriority(
	serviceLister algorithm.ServiceLister,
	controllerLister algorithm.ControllerLister,
	replicaSetLister algorithm.ReplicaSetLister,
	statefulSetLister algorithm.StatefulSetLister) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
	selectorSpread := &SelectorSpread{
		serviceLister:     serviceLister,
		controllerLister:  controllerLister,
		replicaSetLister:  replicaSetLister,
		statefulSetLister: statefulSetLister,
	}
	return selectorSpread.CalculateSpreadPriorityMap, selectorSpread.CalculateSpreadPriorityReduce
}

NewSelectorSpreadPriority主要包括map和reduce两种函数,分别对应CalculateSpreadPriorityMapCalculateSpreadPriorityReduce

8.1. CalculateSpreadPriorityMap

CalculateSpreadPriorityMap的主要作用是将相同service、RC、RS或statefulset的pod分布在不同的节点上。当调度一个pod的时候,先寻找与该pod匹配的service、RS、RC或statefulset,然后寻找与其selector匹配的已存在的pod,寻找存在这类pod最少的节点。

基本流程如下:

  1. 寻找与该pod对应的service、RS、RC、statefulset匹配的selector。
  2. 遍历当前节点的所有pod,将该节点上已存在的selector匹配到的pod的个数作为该节点的分数(此时,分数大的表示匹配到的pod越多,越不符合被调度的条件,该分数在reduce阶段会被按10分制处理成分数大的越符合被调度的条件)。

此部分代码位于pkg/scheduler/algorithm/priorities/selector_spreading.go

// CalculateSpreadPriorityMap spreads pods across hosts, considering pods
// belonging to the same service,RC,RS or StatefulSet.
// When a pod is scheduled, it looks for services, RCs,RSs and StatefulSets that match the pod,
// then finds existing pods that match those selectors.
// It favors nodes that have fewer existing matching pods.
// i.e. it pushes the scheduler towards a node where there's the smallest number of
// pods which match the same service, RC,RSs or StatefulSets selectors as the pod being scheduled.
func (s *SelectorSpread) CalculateSpreadPriorityMap(pod *v1.Pod, meta interface{}, nodeInfo *schedulercache.NodeInfo) (schedulerapi.HostPriority, error) {
	var selectors []labels.Selector
	node := nodeInfo.Node()
	if node == nil {
		return schedulerapi.HostPriority{}, fmt.Errorf("node not found")
	}

	priorityMeta, ok := meta.(*priorityMetadata)
	if ok {
		selectors = priorityMeta.podSelectors
	} else {
		selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)
	}

	if len(selectors) == 0 {
		return schedulerapi.HostPriority{
			Host:  node.Name,
			Score: int(0),
		}, nil
	}

	count := int(0)
	for _, nodePod := range nodeInfo.Pods() {
		if pod.Namespace != nodePod.Namespace {
			continue
		}
		// When we are replacing a failed pod, we often see the previous
		// deleted version while scheduling the replacement.
		// Ignore the previous deleted version for spreading purposes
		// (it can still be considered for resource restrictions etc.)
		if nodePod.DeletionTimestamp != nil {
			glog.V(4).Infof("skipping pending-deleted pod: %s/%s", nodePod.Namespace, nodePod.Name)
			continue
		}
		for _, selector := range selectors {
			if selector.Matches(labels.Set(nodePod.ObjectMeta.Labels)) {
				count++
				break
			}
		}
	}
	return schedulerapi.HostPriority{
		Host:  node.Name,
		Score: int(count),
	}, nil
}

以下分段分析:

先获得selector。

selectors = getSelectors(pod, s.serviceLister, s.controllerLister, s.replicaSetLister, s.statefulSetLister)

计算节点上匹配selector的pod的个数,作为该节点分数,该分数并不是最终节点的分数,只是中间过渡的记录状态。

count := int(0)
for _, nodePod := range nodeInfo.Pods() {
	...
	for _, selector := range selectors {
		if selector.Matches(labels.Set(nodePod.ObjectMeta.Labels)) {
			count++
			break
		}
	}
}

8.2. CalculateSpreadPriorityReduce

CalculateSpreadPriorityReduce根据节点上现有匹配pod的数量计算每个节点的十分制的分数,具有较少现有匹配pod的节点的分数越高,表示节点越可能被调度到。

基本流程如下:

  1. 记录所有节点中匹配到pod个数最多的节点的分数(即匹配到的pod最多的个数)。
  2. 遍历所有的节点,按比例取十分制的得分,计算方式为:(节点中最多匹配pod的个数-当前节点pod的个数)/节点中最多匹配pod的个数。此时,分数越高表示该节点上匹配到的pod的个数越少,越可能被调度到,即满足把相同selector的pod分散到不同节点的需求。

此部分代码位于pkg/scheduler/algorithm/priorities/selector_spreading.go

// CalculateSpreadPriorityReduce calculates the source of each node
// based on the number of existing matching pods on the node
// where zone information is included on the nodes, it favors nodes
// in zones with fewer existing matching pods.
func (s *SelectorSpread) CalculateSpreadPriorityReduce(pod *v1.Pod, meta interface{}, nodeNameToInfo map[string]*schedulercache.NodeInfo, result schedulerapi.HostPriorityList) error {
	countsByZone := make(map[string]int, 10)
	maxCountByZone := int(0)
	maxCountByNodeName := int(0)

	for i := range result {
		if result[i].Score > maxCountByNodeName {
			maxCountByNodeName = result[i].Score
		}
		zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
		if zoneID == "" {
			continue
		}
		countsByZone[zoneID] += result[i].Score
	}

	for zoneID := range countsByZone {
		if countsByZone[zoneID] > maxCountByZone {
			maxCountByZone = countsByZone[zoneID]
		}
	}

	haveZones := len(countsByZone) != 0

	maxCountByNodeNameFloat64 := float64(maxCountByNodeName)
	maxCountByZoneFloat64 := float64(maxCountByZone)
	MaxPriorityFloat64 := float64(schedulerapi.MaxPriority)

	for i := range result {
		// initializing to the default/max node score of maxPriority
		fScore := MaxPriorityFloat64
		if maxCountByNodeName > 0 {
			fScore = MaxPriorityFloat64 * (float64(maxCountByNodeName-result[i].Score) / maxCountByNodeNameFloat64)
		}
		// If there is zone information present, incorporate it
		if haveZones {
			zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
			if zoneID != "" {
				zoneScore := MaxPriorityFloat64
				if maxCountByZone > 0 {
					zoneScore = MaxPriorityFloat64 * (float64(maxCountByZone-countsByZone[zoneID]) / maxCountByZoneFloat64)
				}
				fScore = (fScore * (1.0 - zoneWeighting)) + (zoneWeighting * zoneScore)
			}
		}
		result[i].Score = int(fScore)
		if glog.V(10) {
			glog.Infof(
				"%v -> %v: SelectorSpreadPriority, Score: (%d)", pod.Name, result[i].Host, int(fScore),
			)
		}
	}
	return nil
}

以下分段分析:

先获取所有节点中匹配到的pod最多的个数。

for i := range result {
	if result[i].Score > maxCountByNodeName {
		maxCountByNodeName = result[i].Score
	}
	zoneID := utilnode.GetZoneKey(nodeNameToInfo[result[i].Host].Node())
	if zoneID == "" {
		continue
	}
	countsByZone[zoneID] += result[i].Score
}

遍历所有的节点,按比例取十分制的得分。

for i := range result {
	// initializing to the default/max node score of maxPriority
	fScore := MaxPriorityFloat64
	if maxCountByNodeName > 0 {
		fScore = MaxPriorityFloat64 * (float64(maxCountByNodeName-result[i].Score) / maxCountByNodeNameFloat64)
	}
  ...
}  

9. 总结

优选,从满足的节点中选择出最优的节点。PrioritizeNodes最终返回是一个记录了各个节点分数的列表。

9.1. PrioritizeNodes

主要流程如下:

  1. 如果没有设置优选函数和拓展函数,则全部节点设置相同的分数,直接返回。
  2. 依次给node执行map函数进行打分。
  3. 再对上述map函数的执行结果执行reduce函数计算最终得分。
  4. 最后根据不同优先级函数的权重对得分取加权平均数。

其中每类优选函数会包含map函数和reduce函数两种。

9.2. NewSelectorSpreadPriority

其中以NewSelectorSpreadPriority这个优选函数为例作分析,该函数的功能是将相同service、RS、RC或statefulset下pod尽量分散到不同的节点上。包括map函数和reduce函数两部分,具体如下。

9.2.1. CalculateSpreadPriorityMap

基本流程如下:

  1. 寻找与该pod对应的service、RS、RC、statefulset匹配的selector。
  2. 遍历当前节点的所有pod,将该节点上已存在的selector匹配到的pod的个数作为该节点的分数(此时,分数大的表示匹配到的pod越多,越不符合被调度的条件,该分数在reduce阶段会被按10分制处理成分数大的越符合被调度的条件)。

9.2.2. CalculateSpreadPriorityReduce

基本流程如下:

  1. 记录所有节点中匹配到pod个数最多的节点的分数(即匹配到的pod最多的个数)。
  2. 遍历所有的节点,按比例取十分制的得分,计算方式为:(节点中最多匹配pod的个数-当前节点pod的个数)/节点中最多匹配pod的个数。此时,分数越高表示该节点上匹配到的pod的个数越少,越可能被调度到,即满足把相同selector的pod分散到不同节点的需求。

参考:

11.4.5 -

kube-scheduler源码分析(二)之 registerAlgorithmProvider

以下代码分析基于 kubernetes v1.12.0 版本。

此部分主要介绍调度中使用的各种调度算法,包括调度算法的注册部分。注册部分的代码主要在/pkg/scheduler/algorithmprovider中,具体的预选策略和优选策略的算法实现在/pkg/scheduler/algorithm中。

1. ApplyFeatureGates

注册调度算法的调用入口在SchedulerCommand的Run函数中。

此部分代码位于/cmd/kube-scheduler/app/server.go

// Run runs the Scheduler.
func Run(c schedulerserverconfig.CompletedConfig, stopCh <-chan struct{}) error {
	...
	// Apply algorithms based on feature gates.
	// TODO: make configurable?
	algorithmprovider.ApplyFeatureGates()
  ...
}  

ApplyFeatureGates的具体实现在pkg/scheduler/algorithmprovider的包中。

此部分代码位于/pkg/scheduler/algorithmprovider/plugins.go

// ApplyFeatureGates applies algorithm by feature gates.
func ApplyFeatureGates() {
	defaults.ApplyFeatureGates()
}

ApplyFeatureGates具体实现如下:

此部分代码位于/pkg/scheduler/algorithmprovider/defaults/defaults.go

根据feature移除部分调度策略。

// ApplyFeatureGates applies algorithm by feature gates.
func ApplyFeatureGates() {
	if utilfeature.DefaultFeatureGate.Enabled(features.TaintNodesByCondition) {
		// Remove "CheckNodeCondition", "CheckNodeMemoryPressure", "CheckNodePIDPressurePred"
		// and "CheckNodeDiskPressure" predicates
		factory.RemoveFitPredicate(predicates.CheckNodeConditionPred)
		factory.RemoveFitPredicate(predicates.CheckNodeMemoryPressurePred)
		factory.RemoveFitPredicate(predicates.CheckNodeDiskPressurePred)
		factory.RemoveFitPredicate(predicates.CheckNodePIDPressurePred)
		// Remove key "CheckNodeCondition", "CheckNodeMemoryPressure" and "CheckNodeDiskPressure"
		// from ALL algorithm provider
		// The key will be removed from all providers which in algorithmProviderMap[]
		// if you just want remove specific provider, call func RemovePredicateKeyFromAlgoProvider()
		factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodeConditionPred)
		factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodeMemoryPressurePred)
		factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodeDiskPressurePred)
		factory.RemovePredicateKeyFromAlgorithmProviderMap(predicates.CheckNodePIDPressurePred)

		// Fit is determined based on whether a pod can tolerate all of the node's taints
		factory.RegisterMandatoryFitPredicate(predicates.PodToleratesNodeTaintsPred, predicates.PodToleratesNodeTaints)
		// Fit is determined based on whether a pod can tolerate unschedulable of node
		factory.RegisterMandatoryFitPredicate(predicates.CheckNodeUnschedulablePred, predicates.CheckNodeUnschedulablePredicate)
		// Insert Key "PodToleratesNodeTaints" and "CheckNodeUnschedulable" To All Algorithm Provider
		// The key will insert to all providers which in algorithmProviderMap[]
		// if you just want insert to specific provider, call func InsertPredicateKeyToAlgoProvider()
		factory.InsertPredicateKeyToAlgorithmProviderMap(predicates.PodToleratesNodeTaintsPred)
		factory.InsertPredicateKeyToAlgorithmProviderMap(predicates.CheckNodeUnschedulablePred)

		glog.Warningf("TaintNodesByCondition is enabled, PodToleratesNodeTaints predicate is mandatory")
	}

	// Prioritizes nodes that satisfy pod's resource limits
	if utilfeature.DefaultFeatureGate.Enabled(features.ResourceLimitsPriorityFunction) {
		factory.RegisterPriorityFunction2("ResourceLimitsPriority", priorities.ResourceLimitsPriorityMap, nil, 1)
	}

}

2. init

当函数逻辑调用到algorithmprovider包时,就会自动调用init的初始化函数,此部分主要包括对预选算法和优选算法的注册。

此部分代码位于/pkg/scheduler/algorithmprovider/defaults/defaults.go

func init() {
	// Register functions that extract metadata used by predicates and priorities computations.
	factory.RegisterPredicateMetadataProducerFactory(
		func(args factory.PluginFactoryArgs) algorithm.PredicateMetadataProducer {
			return predicates.NewPredicateMetadataFactory(args.PodLister)
		})
	factory.RegisterPriorityMetadataProducerFactory(
		func(args factory.PluginFactoryArgs) algorithm.PriorityMetadataProducer {
			return priorities.NewPriorityMetadataFactory(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
		})

	registerAlgorithmProvider(defaultPredicates(), defaultPriorities())

	// IMPORTANT NOTES for predicate developers:
	// We are using cached predicate result for pods belonging to the same equivalence class.
	// So when implementing a new predicate, you are expected to check whether the result
	// of your predicate function can be affected by related API object change (ADD/DELETE/UPDATE).
	// If yes, you are expected to invalidate the cached predicate result for related API object change.
	// For example:
	// https://github.com/kubernetes/kubernetes/blob/36a218e/plugin/pkg/scheduler/factory/factory.go#L422

	// Registers predicates and priorities that are not enabled by default, but user can pick when creating their
	// own set of priorities/predicates.

	// PodFitsPorts has been replaced by PodFitsHostPorts for better user understanding.
	// For backwards compatibility with 1.0, PodFitsPorts is registered as well.
	factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
	// Fit is defined based on the absence of port conflicts.
	// This predicate is actually a default predicate, because it is invoked from
	// predicates.GeneralPredicates()
	factory.RegisterFitPredicate(predicates.PodFitsHostPortsPred, predicates.PodFitsHostPorts)
	// Fit is determined by resource availability.
	// This predicate is actually a default predicate, because it is invoked from
	// predicates.GeneralPredicates()
	factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
	// Fit is determined by the presence of the Host parameter and a string match
	// This predicate is actually a default predicate, because it is invoked from
	// predicates.GeneralPredicates()
	factory.RegisterFitPredicate(predicates.HostNamePred, predicates.PodFitsHost)
	// Fit is determined by node selector query.
	factory.RegisterFitPredicate(predicates.MatchNodeSelectorPred, predicates.PodMatchNodeSelector)

	// ServiceSpreadingPriority is a priority config factory that spreads pods by minimizing
	// the number of pods (belonging to the same service) on the same node.
	// Register the factory so that it's available, but do not include it as part of the default priorities
	// Largely replaced by "SelectorSpreadPriority", but registered for backward compatibility with 1.0
	factory.RegisterPriorityConfigFactory(
		"ServiceSpreadingPriority",
		factory.PriorityConfigFactory{
			MapReduceFunction: func(args factory.PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
				return priorities.NewSelectorSpreadPriority(args.ServiceLister, algorithm.EmptyControllerLister{}, algorithm.EmptyReplicaSetLister{}, algorithm.EmptyStatefulSetLister{})
			},
			Weight: 1,
		},
	)
	// EqualPriority is a prioritizer function that gives an equal weight of one to all nodes
	// Register the priority function so that its available
	// but do not include it as part of the default priorities
	factory.RegisterPriorityFunction2("EqualPriority", core.EqualPriorityMap, nil, 1)
	// Optional, cluster-autoscaler friendly priority function - give used nodes higher priority.
	factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)
	factory.RegisterPriorityFunction2(
		"RequestedToCapacityRatioPriority",
		priorities.RequestedToCapacityRatioResourceAllocationPriorityDefault().PriorityMap,
		nil,
		1)
}

以下对init中的注册进行拆分介绍。

2.1. registerAlgorithmProvider

此部分主要注册默认的预选和优选策略。

// Register functions that extract metadata used by predicates and priorities computations.
factory.RegisterPredicateMetadataProducerFactory(
	func(args factory.PluginFactoryArgs) algorithm.PredicateMetadataProducer {
		return predicates.NewPredicateMetadataFactory(args.PodLister)
	})
factory.RegisterPriorityMetadataProducerFactory(
	func(args factory.PluginFactoryArgs) algorithm.PriorityMetadataProducer {
		return priorities.NewPriorityMetadataFactory(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
	})

registerAlgorithmProvider(defaultPredicates(), defaultPriorities())

registerAlgorithmProvider

注册AlgorithmProvider,其中包括DefaultProviderClusterAutoscalerProvider

func registerAlgorithmProvider(predSet, priSet sets.String) {
	// Registers algorithm providers. By default we use 'DefaultProvider', but user can specify one to be used
	// by specifying flag.
	factory.RegisterAlgorithmProvider(factory.DefaultProvider, predSet, priSet)
	// Cluster autoscaler friendly scheduling algorithm.
	factory.RegisterAlgorithmProvider(ClusterAutoscalerProvider, predSet,
		copyAndReplace(priSet, "LeastRequestedPriority", "MostRequestedPriority"))
}

2.2. RegisterFitPredicate

在init部分注册预选策略函数。

预选策略如下:

调度策略 函数 描述
PodFitsPorts PodFitsHostPorts PodFitsPorts已经被PodFitsHostPorts代替,此处主要是为了兼容性。
PodFitsHostPortsPred PodFitsHostPorts 判断是否与宿主机的端口冲突。
PodFitsResourcesPred PodFitsResources 判断node资源是否充足。
HostNamePred PodFitsHost 判断pod所指定调度的节点是否是当前的节点。
MatchNodeSelectorPred PodMatchNodeSelector 判断pod指定的node selector是否匹配当前的node。

具体代码如下:

// PodFitsPorts has been replaced by PodFitsHostPorts for better user understanding.
// For backwards compatibility with 1.0, PodFitsPorts is registered as well.
factory.RegisterFitPredicate("PodFitsPorts", predicates.PodFitsHostPorts)
// Fit is defined based on the absence of port conflicts.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.PodFitsHostPortsPred, predicates.PodFitsHostPorts)
// Fit is determined by resource availability.
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.PodFitsResourcesPred, predicates.PodFitsResources)
// Fit is determined by the presence of the Host parameter and a string match
// This predicate is actually a default predicate, because it is invoked from
// predicates.GeneralPredicates()
factory.RegisterFitPredicate(predicates.HostNamePred, predicates.PodFitsHost)
// Fit is determined by node selector query.
factory.RegisterFitPredicate(predicates.MatchNodeSelectorPred, predicates.PodMatchNodeSelector)

2.3. RegisterPriorityFunction2

在init部分注册优选策略函数。

// EqualPriority is a prioritizer function that gives an equal weight of one to all nodes
// Register the priority function so that its available
// but do not include it as part of the default priorities
factory.RegisterPriorityFunction2("EqualPriority", core.EqualPriorityMap, nil, 1)
// Optional, cluster-autoscaler friendly priority function - give used nodes higher priority.
factory.RegisterPriorityFunction2("MostRequestedPriority", priorities.MostRequestedPriorityMap, nil, 1)
factory.RegisterPriorityFunction2(
	"RequestedToCapacityRatioPriority",
	priorities.RequestedToCapacityRatioResourceAllocationPriorityDefault().PriorityMap,
	nil,
	1)

3. defaultPredicates

此部分为默认预选策略的注册函数。

默认的预选策略如下:

预选策略 函数 描述
NoVolumeZoneConflictPred NewVolumeZonePredicate 判断pod使用到的volume是否有节点的要求。目前只支持pvc。
MaxEBSVolumeCountPred NewMaxPDVolumeCountPredicate 判断pod使用EBSVolume在该节点上是否已经达到上限了。
MaxGCEPDVolumeCountPred NewMaxPDVolumeCountPredicate 判断pod使用GCEPDVolume在该节点上是否已经达到上限了。
MaxAzureDiskVolumeCountPred NewMaxPDVolumeCountPredicate 判断pod使用AzureDiskVolume在该节点上是否已经达到上限了。
MaxCSIVolumeCountPred NewCSIMaxVolumeLimitPredicate 判断CSIVolume是否达到上限了。
MatchInterPodAffinityPred NewPodAffinityPredicate 匹配pod的亲缘性。
NoDiskConflictPred NoDiskConflict 判断是否有disk volumes的冲突。
GeneralPred GeneralPredicates 通用的预选策略
CheckNodeMemoryPressurePred CheckNodeMemoryPressurePredicate 判断节点内存是否充足。
CheckNodeDiskPressurePred CheckNodeDiskPressurePredicate 判断节点是否有磁盘压力。
CheckNodePIDPressurePred CheckNodePIDPressurePredicate 判断节点上的PID
CheckNodeConditionPred CheckNodeConditionPredicate 判断node是否ready。
PodToleratesNodeTaintsPred PodToleratesNodeTaints 判断pod是否可以容忍节点的taints。
CheckVolumeBindingPred NewVolumeBindingPredicate 判断是否有volume拓扑的要求。

具体代码如下:

func defaultPredicates() sets.String {
	return sets.NewString(
		// Fit is determined by volume zone requirements.
		factory.RegisterFitPredicateFactory(
			predicates.NoVolumeZoneConflictPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewVolumeZonePredicate(args.PVInfo, args.PVCInfo, args.StorageClassInfo)
			},
		),
		// Fit is determined by whether or not there would be too many AWS EBS volumes attached to the node
		factory.RegisterFitPredicateFactory(
			predicates.MaxEBSVolumeCountPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewMaxPDVolumeCountPredicate(predicates.EBSVolumeFilterType, args.PVInfo, args.PVCInfo)
			},
		),
		// Fit is determined by whether or not there would be too many GCE PD volumes attached to the node
		factory.RegisterFitPredicateFactory(
			predicates.MaxGCEPDVolumeCountPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewMaxPDVolumeCountPredicate(predicates.GCEPDVolumeFilterType, args.PVInfo, args.PVCInfo)
			},
		),
		// Fit is determined by whether or not there would be too many Azure Disk volumes attached to the node
		factory.RegisterFitPredicateFactory(
			predicates.MaxAzureDiskVolumeCountPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewMaxPDVolumeCountPredicate(predicates.AzureDiskVolumeFilterType, args.PVInfo, args.PVCInfo)
			},
		),
		factory.RegisterFitPredicateFactory(
			predicates.MaxCSIVolumeCountPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewCSIMaxVolumeLimitPredicate(args.PVInfo, args.PVCInfo)
			},
		),
		// Fit is determined by inter-pod affinity.
		factory.RegisterFitPredicateFactory(
			predicates.MatchInterPodAffinityPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewPodAffinityPredicate(args.NodeInfo, args.PodLister)
			},
		),

		// Fit is determined by non-conflicting disk volumes.
		factory.RegisterFitPredicate(predicates.NoDiskConflictPred, predicates.NoDiskConflict),

		// GeneralPredicates are the predicates that are enforced by all Kubernetes components
		// (e.g. kubelet and all schedulers)
		factory.RegisterFitPredicate(predicates.GeneralPred, predicates.GeneralPredicates),

		// Fit is determined by node memory pressure condition.
		factory.RegisterFitPredicate(predicates.CheckNodeMemoryPressurePred, predicates.CheckNodeMemoryPressurePredicate),

		// Fit is determined by node disk pressure condition.
		factory.RegisterFitPredicate(predicates.CheckNodeDiskPressurePred, predicates.CheckNodeDiskPressurePredicate),

		// Fit is determined by node pid pressure condition.
		factory.RegisterFitPredicate(predicates.CheckNodePIDPressurePred, predicates.CheckNodePIDPressurePredicate),

		// Fit is determined by node conditions: not ready, network unavailable or out of disk.
		factory.RegisterMandatoryFitPredicate(predicates.CheckNodeConditionPred, predicates.CheckNodeConditionPredicate),

		// Fit is determined based on whether a pod can tolerate all of the node's taints
		factory.RegisterFitPredicate(predicates.PodToleratesNodeTaintsPred, predicates.PodToleratesNodeTaints),

		// Fit is determined by volume topology requirements.
		factory.RegisterFitPredicateFactory(
			predicates.CheckVolumeBindingPred,
			func(args factory.PluginFactoryArgs) algorithm.FitPredicate {
				return predicates.NewVolumeBindingPredicate(args.VolumeBinder)
			},
		),
	)
}

4. defaultPriorities

此部分主要为默认优选策略的注册函数。

默认优选策略如下:

优选策略 函数 描述
SelectorSpreadPriority NewSelectorSpreadPriority 属于相同service和rs下的pod尽量分布在不同的node上。
InterPodAffinityPriority NewInterPodAffinityPriority 根据pod的亲缘性,将相同拓扑域中的pod放在同一个节点
LeastRequestedPriority LeastRequestedPriorityMap 按最少请求的利用率对节点进行优先级排序。
BalancedResourceAllocation BalancedResourceAllocationMap 实现资源的平衡使用。
NodePreferAvoidPodsPriority CalculateNodePreferAvoidPodsPriorityMap 将此权重设置为足以覆盖所有其他优先级函数。
NodeAffinityPriority CalculateNodeAffinityPriorityMap pod指定label节点调度,来匹配node亲缘性。
TaintTolerationPriority ComputeTaintTolerationPriorityMap pod有设置tolerate属性来容忍node的taint。
ImageLocalityPriority ImageLocalityPriorityMap 根据节点上是否有该pod使用到的镜像打分。

具体代码实现如下:

func defaultPriorities() sets.String {
	return sets.NewString(
		// spreads pods by minimizing the number of pods (belonging to the same service or replication controller) on the same node.
		factory.RegisterPriorityConfigFactory(
			"SelectorSpreadPriority",
			factory.PriorityConfigFactory{
				MapReduceFunction: func(args factory.PluginFactoryArgs) (algorithm.PriorityMapFunction, algorithm.PriorityReduceFunction) {
					return priorities.NewSelectorSpreadPriority(args.ServiceLister, args.ControllerLister, args.ReplicaSetLister, args.StatefulSetLister)
				},
				Weight: 1,
			},
		),
		// pods should be placed in the same topological domain (e.g. same node, same rack, same zone, same power domain, etc.)
		// as some other pods, or, conversely, should not be placed in the same topological domain as some other pods.
		factory.RegisterPriorityConfigFactory(
			"InterPodAffinityPriority",
			factory.PriorityConfigFactory{
				Function: func(args factory.PluginFactoryArgs) algorithm.PriorityFunction {
					return priorities.NewInterPodAffinityPriority(args.NodeInfo, args.NodeLister, args.PodLister, args.HardPodAffinitySymmetricWeight)
				},
				Weight: 1,
			},
		),

		// Prioritize nodes by least requested utilization.
		factory.RegisterPriorityFunction2("LeastRequestedPriority", priorities.LeastRequestedPriorityMap, nil, 1),

		// Prioritizes nodes to help achieve balanced resource usage
		factory.RegisterPriorityFunction2("BalancedResourceAllocation", priorities.BalancedResourceAllocationMap, nil, 1),

		// Set this weight large enough to override all other priority functions.
		// TODO: Figure out a better way to do this, maybe at same time as fixing #24720.
		factory.RegisterPriorityFunction2("NodePreferAvoidPodsPriority", priorities.CalculateNodePreferAvoidPodsPriorityMap, nil, 10000),

		// Prioritizes nodes that have labels matching NodeAffinity
		factory.RegisterPriorityFunction2("NodeAffinityPriority", priorities.CalculateNodeAffinityPriorityMap, priorities.CalculateNodeAffinityPriorityReduce, 1),

		// Prioritizes nodes that marked with taint which pod can tolerate.
		factory.RegisterPriorityFunction2("TaintTolerationPriority", priorities.ComputeTaintTolerationPriorityMap, priorities.ComputeTaintTolerationPriorityReduce, 1),

		// ImageLocalityPriority prioritizes nodes that have images requested by the pod present.
		factory.RegisterPriorityFunction2("ImageLocalityPriority", priorities.ImageLocalityPriorityMap, nil, 1),
	)
}

参考:

11.4.6 -

kube-scheduler源码分析(三)之 scheduleOne

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析/pkg/scheduler/中调度的基本流程。具体的预选调度逻辑优选调度逻辑节点抢占逻辑待后续再独立分析。

scheduler的pkg代码目录结构如下:

scheduler
├── algorithm         # 主要包含调度的算法
│   ├── predicates    # 预选的策略
│   ├── priorities    # 优选的策略
│   ├── scheduler_interface.go    # ScheduleAlgorithm、SchedulerExtender接口定义
│   ├── types.go      # 使用到的type的定义
├── algorithmprovider
│   ├── defaults
│   │   ├── defaults.go    # 默认算法的初始化操作,包括预选和优选策略
├── cache      # scheduler调度使用到的cache
│   ├── cache.go    # schedulerCache
│   ├── interface.go
│   ├── node_info.go
│   ├── node_tree.go
├── core       # 调度逻辑的核心代码
│   ├── equivalence
│   │   ├── eqivalence.go       # 存储相同pod的调度结果缓存,主要给预选策略使用
│   ├── extender.go
│   ├── generic_scheduler.go    # genericScheduler,主要包含默认调度器的调度逻辑
│   ├── scheduling_queue.go     # 调度使用到的队列,主要用来存储需要被调度的pod
├── factory
│   ├── factory.go   # 主要包括NewConfigFactory、NewPodInformer,监听pod事件来更新调度队列
├── metrics
│   └── metrics.go   # 主要给prometheus使用
├── scheduler.go # pkg部分的Run入口(核心代码),主要包含Run、scheduleOne、schedule、preempt等函数
└── volumebinder
    └── volume_binder.go   # volume bind

1. Scheduler.Run

此部分代码位于pkg/scheduler/scheduler.go

此处为具体调度逻辑的入口。

// Run begins watching and scheduling. It waits for cache to be synced, then starts a goroutine and returns immediately.
func (sched *Scheduler) Run() {
	if !sched.config.WaitForCacheSync() {
		return
	}

	go wait.Until(sched.scheduleOne, 0, sched.config.StopEverything)
}

2. Scheduler.scheduleOne

此部分代码位于pkg/scheduler/scheduler.go

scheduleOne主要为单个pod选择一个适合的节点,为调度逻辑的核心函数。

对单个pod进行调度的基本流程如下:

  1. 通过podQueue的待调度队列中弹出需要调度的pod。
  2. 通过具体的调度算法为该pod选出合适的节点,其中调度算法就包括预选和优选两步策略。
  3. 如果上述调度失败,则会尝试抢占机制,将优先级低的pod剔除,让优先级高的pod调度成功。
  4. 将该pod和选定的节点进行假性绑定,存入scheduler cache中,方便具体绑定操作可以异步进行。
  5. 实际执行绑定操作,将node的名字添加到pod的节点相关属性中。

完整代码如下:

// scheduleOne does the entire scheduling workflow for a single pod.  It is serialized on the scheduling algorithm's host fitting.
func (sched *Scheduler) scheduleOne() {
	pod := sched.config.NextPod()
	if pod.DeletionTimestamp != nil {
		sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
		return
	}

	glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)

	// Synchronously attempt to find a fit for the pod.
	start := time.Now()
	suggestedHost, err := sched.schedule(pod)
	if err != nil {
		// schedule() may have failed because the pod would not fit on any host, so we try to
		// preempt, with the expectation that the next time the pod is tried for scheduling it
		// will fit due to the preemption. It is also possible that a different pod will schedule
		// into the resources that were preempted, but this is harmless.
		if fitError, ok := err.(*core.FitError); ok {
			preemptionStartTime := time.Now()
			sched.preempt(pod, fitError)
			metrics.PreemptionAttempts.Inc()
			metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
			metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
		}
		return
	}
	metrics.SchedulingAlgorithmLatency.Observe(metrics.SinceInMicroseconds(start))
	// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
	// This allows us to keep scheduling without waiting on binding to occur.
	assumedPod := pod.DeepCopy()

	// Assume volumes first before assuming the pod.
	//
	// If all volumes are completely bound, then allBound is true and binding will be skipped.
	//
	// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
	//
	// This function modifies 'assumedPod' if volume binding is required.
	allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
	if err != nil {
		return
	}

	// assume modifies `assumedPod` by setting NodeName=suggestedHost
	err = sched.assume(assumedPod, suggestedHost)
	if err != nil {
		return
	}
	// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
	go func() {
		// Bind volumes first before Pod
		if !allBound {
			err = sched.bindVolumes(assumedPod)
			if err != nil {
				return
			}
		}

		err := sched.bind(assumedPod, &v1.Binding{
			ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
			Target: v1.ObjectReference{
				Kind: "Node",
				Name: suggestedHost,
			},
		})
		metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
		if err != nil {
			glog.Errorf("Internal error binding pod: (%v)", err)
		}
	}()
}

以下对重要代码分别进行分析。

3. config.NextPod

通过podQueue的方式存储待调度的pod队列,NextPod拿出下一个需要被调度的pod。

pod := sched.config.NextPod()
if pod.DeletionTimestamp != nil {
	sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
	glog.V(3).Infof("Skip schedule deleting pod: %v/%v", pod.Namespace, pod.Name)
	return
}

glog.V(3).Infof("Attempting to schedule pod: %v/%v", pod.Namespace, pod.Name)

NextPod的具体函数在factory.go的CreateFromKey函数中定义,如下:

func (c *configFactory) CreateFromKeys(predicateKeys, priorityKeys sets.String, extenders []algorithm.SchedulerExtender) (*scheduler.Config, error) {
...
  	return &scheduler.Config{
    ...
		NextPod: func() *v1.Pod {
			return c.getNextPod()
		}
    ...
}      

3.1. getNextPod

通过一个podQueue来存储需要调度的pod的队列,通过队列Pop的方式弹出需要被调度的pod。

func (c *configFactory) getNextPod() *v1.Pod {
	pod, err := c.podQueue.Pop()
	if err == nil {
		glog.V(4).Infof("About to try and schedule pod %v/%v", pod.Namespace, pod.Name)
		return pod
	}
	glog.Errorf("Error while retrieving next pod from scheduling queue: %v", err)
	return nil
}

4. Scheduler.schedule

此部分代码位于pkg/scheduler/scheduler.go

此部分为调度逻辑的核心,通过不同的算法为具体的pod选择一个最合适的节点。

// Synchronously attempt to find a fit for the pod.
start := time.Now()
suggestedHost, err := sched.schedule(pod)
if err != nil {
	// schedule() may have failed because the pod would not fit on any host, so we try to
	// preempt, with the expectation that the next time the pod is tried for scheduling it
	// will fit due to the preemption. It is also possible that a different pod will schedule
	// into the resources that were preempted, but this is harmless.
	if fitError, ok := err.(*core.FitError); ok {
		preemptionStartTime := time.Now()
		sched.preempt(pod, fitError)
		metrics.PreemptionAttempts.Inc()
		metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
		metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
	}
	return
}

schedule通过调度算法返回一个最优的节点。

// schedule implements the scheduling algorithm and returns the suggested host.
func (sched *Scheduler) schedule(pod *v1.Pod) (string, error) {
	host, err := sched.config.Algorithm.Schedule(pod, sched.config.NodeLister)
	if err != nil {
		pod = pod.DeepCopy()
		sched.config.Error(pod, err)
		sched.config.Recorder.Eventf(pod, v1.EventTypeWarning, "FailedScheduling", "%v", err)
		sched.config.PodConditionUpdater.Update(pod, &v1.PodCondition{
			Type:    v1.PodScheduled,
			Status:  v1.ConditionFalse,
			Reason:  v1.PodReasonUnschedulable,
			Message: err.Error(),
		})
		return "", err
	}
	return host, err
}

4.1. ScheduleAlgorithm

ScheduleAlgorithm是一个调度算法的接口,主要的实现体是genericScheduler,后续分析genericScheduler.Schedule

ScheduleAlgorithm接口定义如下:

// ScheduleAlgorithm is an interface implemented by things that know how to schedule pods
// onto machines.
type ScheduleAlgorithm interface {
	Schedule(*v1.Pod, NodeLister) (selectedMachine string, err error)
	// Preempt receives scheduling errors for a pod and tries to create room for
	// the pod by preempting lower priority pods if possible.
	// It returns the node where preemption happened, a list of preempted pods, a
	// list of pods whose nominated node name should be removed, and error if any.
	Preempt(*v1.Pod, NodeLister, error) (selectedNode *v1.Node, preemptedPods []*v1.Pod, cleanupNominatedPods []*v1.Pod, err error)
	// Predicates() returns a pointer to a map of predicate functions. This is
	// exposed for testing.
	Predicates() map[string]FitPredicate
	// Prioritizers returns a slice of priority config. This is exposed for
	// testing.
	Prioritizers() []PriorityConfig
}

5. genericScheduler.Schedule

此部分代码位于/pkg/scheduler/core/generic_scheduler.go

genericScheduler.Schedule实现了基本的调度逻辑,基于给定需要调度的pod和node列表,如果执行成功返回调度的节点的名字,如果执行失败,则返回错误和原因。主要通过预选和优选两步操作完成调度的逻辑。

基本流程如下:

  1. 对pod做基本性检查,目前主要是对pvc的检查。
  2. 通过findNodesThatFit预选策略选出满足调度条件的node列表。
  3. 通过PrioritizeNodes优选策略给预选的node列表中的node进行打分。
  4. 在打分的node列表中选择一个分数最高的node作为调度的节点。

完整代码如下:

// Schedule tries to schedule the given pod to one of the nodes in the node list.
// If it succeeds, it will return the name of the node.
// If it fails, it will return a FitError error with reasons.
func (g *genericScheduler) Schedule(pod *v1.Pod, nodeLister algorithm.NodeLister) (string, error) {
	trace := utiltrace.New(fmt.Sprintf("Scheduling %s/%s", pod.Namespace, pod.Name))
	defer trace.LogIfLong(100 * time.Millisecond)

	if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
		return "", err
	}

	nodes, err := nodeLister.List()
	if err != nil {
		return "", err
	}
	if len(nodes) == 0 {
		return "", ErrNoNodesAvailable
	}

	// Used for all fit and priority funcs.
	err = g.cache.UpdateNodeNameToInfoMap(g.cachedNodeInfoMap)
	if err != nil {
		return "", err
	}

	trace.Step("Computing predicates")
	startPredicateEvalTime := time.Now()
	filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
	if err != nil {
		return "", err
	}

	if len(filteredNodes) == 0 {
		return "", &FitError{
			Pod:              pod,
			NumAllNodes:      len(nodes),
			FailedPredicates: failedPredicateMap,
		}
	}
	metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))

	trace.Step("Prioritizing")
	startPriorityEvalTime := time.Now()
	// When only one node after predicate, just use it.
	if len(filteredNodes) == 1 {
		metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
		return filteredNodes[0].Name, nil
	}

	metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
	priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
	if err != nil {
		return "", err
	}
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))

	trace.Step("Selecting host")
	return g.selectHost(priorityList)
}

5.1. podPassesBasicChecks

podPassesBasicChecks主要做一下基本性检查,目前主要是对pvc的检查。

if err := podPassesBasicChecks(pod, g.pvcLister); err != nil {
	return "", err
}

podPassesBasicChecks具体实现如下:

// podPassesBasicChecks makes sanity checks on the pod if it can be scheduled.
func podPassesBasicChecks(pod *v1.Pod, pvcLister corelisters.PersistentVolumeClaimLister) error {
	// Check PVCs used by the pod
	namespace := pod.Namespace
	manifest := &(pod.Spec)
	for i := range manifest.Volumes {
		volume := &manifest.Volumes[i]
		if volume.PersistentVolumeClaim == nil {
			// Volume is not a PVC, ignore
			continue
		}
		pvcName := volume.PersistentVolumeClaim.ClaimName
		pvc, err := pvcLister.PersistentVolumeClaims(namespace).Get(pvcName)
		if err != nil {
			// The error has already enough context ("persistentvolumeclaim "myclaim" not found")
			return err
		}

		if pvc.DeletionTimestamp != nil {
			return fmt.Errorf("persistentvolumeclaim %q is being deleted", pvc.Name)
		}
	}

	return nil
}

5.2. findNodesThatFit

预选,通过预选函数来判断每个节点是否适合被该Pod调度。

具体的findNodesThatFit代码实现细节待后续文章独立分析。

genericScheduler.Schedule中对findNodesThatFit的调用过程如下:

trace.Step("Computing predicates")
startPredicateEvalTime := time.Now()
filteredNodes, failedPredicateMap, err := g.findNodesThatFit(pod, nodes)
if err != nil {
	return "", err
}

if len(filteredNodes) == 0 {
	return "", &FitError{
		Pod:              pod,
		NumAllNodes:      len(nodes),
		FailedPredicates: failedPredicateMap,
	}
}
metrics.SchedulingAlgorithmPredicateEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPredicateEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PredicateEvaluation).Observe(metrics.SinceInSeconds(startPredicateEvalTime))

5.3. PrioritizeNodes

优选,从满足的节点中选择出最优的节点。

具体操作如下:

  • PrioritizeNodes通过并行运行各个优先级函数来对节点进行优先级排序。
  • 每个优先级函数会给节点打分,打分范围为0-10分。
  • 0 表示优先级最低的节点,10表示优先级最高的节点。
  • 每个优先级函数也有各自的权重。
  • 优先级函数返回的节点分数乘以权重以获得加权分数。
  • 最后组合(添加)所有分数以获得所有节点的总加权分数。

具体PrioritizeNodes的实现逻辑待后续文章独立分析。

genericScheduler.Schedule中对PrioritizeNodes的调用过程如下:

trace.Step("Prioritizing")
startPriorityEvalTime := time.Now()
// When only one node after predicate, just use it.
if len(filteredNodes) == 1 {
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	return filteredNodes[0].Name, nil
}
metaPrioritiesInterface := g.priorityMetaProducer(pod, g.cachedNodeInfoMap)
priorityList, err := PrioritizeNodes(pod, g.cachedNodeInfoMap, metaPrioritiesInterface, g.prioritizers, filteredNodes, g.extenders)
if err != nil {
	return "", err
}
	metrics.SchedulingAlgorithmPriorityEvaluationDuration.Observe(metrics.SinceInMicroseconds(startPriorityEvalTime))
	metrics.SchedulingLatency.WithLabelValues(metrics.PriorityEvaluation).Observe(metrics.SinceInSeconds(startPriorityEvalTime))

5.4. selectHost

scheduler在最后会从priorityList中选择分数最高的一个节点。

trace.Step("Selecting host")
return g.selectHost(priorityList)

selectHost获取优先级的节点列表,然后从分数最高的节点以循环方式选择一个节点。

具体代码如下:

// selectHost takes a prioritized list of nodes and then picks one
// in a round-robin manner from the nodes that had the highest score.
func (g *genericScheduler) selectHost(priorityList schedulerapi.HostPriorityList) (string, error) {
	if len(priorityList) == 0 {
		return "", fmt.Errorf("empty priorityList")
	}

	maxScores := findMaxScores(priorityList)
	ix := int(g.lastNodeIndex % uint64(len(maxScores)))
	g.lastNodeIndex++

	return priorityList[maxScores[ix]].Host, nil
}

5.4.1. findMaxScores

findMaxScores返回priorityList中具有最高Score的节点的索引。

// findMaxScores returns the indexes of nodes in the "priorityList" that has the highest "Score".
func findMaxScores(priorityList schedulerapi.HostPriorityList) []int {
	maxScoreIndexes := make([]int, 0, len(priorityList)/2)
	maxScore := priorityList[0].Score
	for i, hp := range priorityList {
		if hp.Score > maxScore {
			maxScore = hp.Score
			maxScoreIndexes = maxScoreIndexes[:0]
			maxScoreIndexes = append(maxScoreIndexes, i)
		} else if hp.Score == maxScore {
			maxScoreIndexes = append(maxScoreIndexes, i)
		}
	}
	return maxScoreIndexes
}

6. Scheduler.preempt

如果pod在预选和优选调度中失败,则执行抢占操作。抢占主要是将低优先级的pod的资源空间腾出给待调度的高优先级的pod。

具体Scheduler.preempt的实现逻辑待后续文章独立分析。

suggestedHost, err := sched.schedule(pod)
if err != nil {
	// schedule() may have failed because the pod would not fit on any host, so we try to
	// preempt, with the expectation that the next time the pod is tried for scheduling it
	// will fit due to the preemption. It is also possible that a different pod will schedule
	// into the resources that were preempted, but this is harmless.
	if fitError, ok := err.(*core.FitError); ok {
		preemptionStartTime := time.Now()
		sched.preempt(pod, fitError)
		metrics.PreemptionAttempts.Inc()
		metrics.SchedulingAlgorithmPremptionEvaluationDuration.Observe(metrics.SinceInMicroseconds(preemptionStartTime))
		metrics.SchedulingLatency.WithLabelValues(metrics.PreemptionEvaluation).Observe(metrics.SinceInSeconds(preemptionStartTime))
	}
	return
}

7. Scheduler.assume

将该pod和选定的节点进行假性绑定,存入scheduler cache中,方便可以继续执行调度逻辑,而不需要等待绑定操作的发生,具体绑定操作可以异步进行。

// Tell the cache to assume that a pod now is running on a given node, even though it hasn't been bound yet.
// This allows us to keep scheduling without waiting on binding to occur.
assumedPod := pod.DeepCopy()

// Assume volumes first before assuming the pod.
//
// If all volumes are completely bound, then allBound is true and binding will be skipped.
//
// Otherwise, binding of volumes is started after the pod is assumed, but before pod binding.
//
// This function modifies 'assumedPod' if volume binding is required.
allBound, err := sched.assumeVolumes(assumedPod, suggestedHost)
if err != nil {
	return
}

// assume modifies `assumedPod` by setting NodeName=suggestedHost
err = sched.assume(assumedPod, suggestedHost)
if err != nil {
	return
}

如果假性绑定成功则发送请求给apiserver,如果失败则scheduler会立即释放已分配给假性绑定的pod的资源。

assume方法的具体实现:

// assume signals to the cache that a pod is already in the cache, so that binding can be asynchronous.
// assume modifies `assumed`.
func (sched *Scheduler) assume(assumed *v1.Pod, host string) error {
	// Optimistically assume that the binding will succeed and send it to apiserver
	// in the background.
	// If the binding fails, scheduler will release resources allocated to assumed pod
	// immediately.
	assumed.Spec.NodeName = host
	// NOTE: Because the scheduler uses snapshots of SchedulerCache and the live
	// version of Ecache, updates must be written to SchedulerCache before
	// invalidating Ecache.
	if err := sched.config.SchedulerCache.AssumePod(assumed); err != nil {
		glog.Errorf("scheduler cache AssumePod failed: %v", err)

		// This is most probably result of a BUG in retrying logic.
		// We report an error here so that pod scheduling can be retried.
		// This relies on the fact that Error will check if the pod has been bound
		// to a node and if so will not add it back to the unscheduled pods queue
		// (otherwise this would cause an infinite loop).
		sched.config.Error(assumed, err)
		sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "AssumePod failed: %v", err)
		sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
			Type:    v1.PodScheduled,
			Status:  v1.ConditionFalse,
			Reason:  "SchedulerError",
			Message: err.Error(),
		})
		return err
	}

	// Optimistically assume that the binding will succeed, so we need to invalidate affected
	// predicates in equivalence cache.
	// If the binding fails, these invalidated item will not break anything.
	if sched.config.Ecache != nil {
		sched.config.Ecache.InvalidateCachedPredicateItemForPodAdd(assumed, host)
	}
	return nil
}

8. Scheduler.bind

异步的方式给pod绑定到具体的调度节点上。

// bind the pod to its host asynchronously (we can do this b/c of the assumption step above).
go func() {
	// Bind volumes first before Pod
	if !allBound {
		err = sched.bindVolumes(assumedPod)
		if err != nil {
			return
		}
	}
	err := sched.bind(assumedPod, &v1.Binding{
		ObjectMeta: metav1.ObjectMeta{Namespace: assumedPod.Namespace, Name: assumedPod.Name, UID: assumedPod.UID},
		Target: v1.ObjectReference{
			Kind: "Node",
			Name: suggestedHost,
		},
	})
	metrics.E2eSchedulingLatency.Observe(metrics.SinceInMicroseconds(start))
	if err != nil {
		glog.Errorf("Internal error binding pod: (%v)", err)
	}
}()

bind具体实现如下:

// bind binds a pod to a given node defined in a binding object.  We expect this to run asynchronously, so we
// handle binding metrics internally.
func (sched *Scheduler) bind(assumed *v1.Pod, b *v1.Binding) error {
	bindingStart := time.Now()
	// If binding succeeded then PodScheduled condition will be updated in apiserver so that
	// it's atomic with setting host.
	err := sched.config.GetBinder(assumed).Bind(b)
	if err := sched.config.SchedulerCache.FinishBinding(assumed); err != nil {
		glog.Errorf("scheduler cache FinishBinding failed: %v", err)
	}
	if err != nil {
		glog.V(1).Infof("Failed to bind pod: %v/%v", assumed.Namespace, assumed.Name)
		if err := sched.config.SchedulerCache.ForgetPod(assumed); err != nil {
			glog.Errorf("scheduler cache ForgetPod failed: %v", err)
		}
		sched.config.Error(assumed, err)
		sched.config.Recorder.Eventf(assumed, v1.EventTypeWarning, "FailedScheduling", "Binding rejected: %v", err)
		sched.config.PodConditionUpdater.Update(assumed, &v1.PodCondition{
			Type:   v1.PodScheduled,
			Status: v1.ConditionFalse,
			Reason: "BindingRejected",
		})
		return err
	}

	metrics.BindingLatency.Observe(metrics.SinceInMicroseconds(bindingStart))
	metrics.SchedulingLatency.WithLabelValues(metrics.Binding).Observe(metrics.SinceInSeconds(bindingStart))
	sched.config.Recorder.Eventf(assumed, v1.EventTypeNormal, "Scheduled", "Successfully assigned %v/%v to %v", assumed.Namespace, assumed.Name, b.Target.Name)
	return nil
}

9. 总结

本文主要分析了单个pod的调度过程。具体流程如下:

  1. 通过podQueue的待调度队列中弹出需要调度的pod。
  2. 通过具体的调度算法为该pod选出合适的节点,其中调度算法就包括预选和优选两步策略。
  3. 如果上述调度失败,则会尝试抢占机制,将优先级低的pod剔除,让优先级高的pod调度成功。
  4. 将该pod和选定的节点进行假性绑定,存入scheduler cache中,方便具体绑定操作可以异步进行。
  5. 实际执行绑定操作,将node的名字添加到pod的节点相关属性中。

其中核心的部分为通过具体的调度算法选出调度节点的过程,即genericScheduler.Schedule的实现部分。该部分包括预选和优选两个部分。

genericScheduler.Schedule调度的基本流程如下:

  1. 对pod做基本性检查,目前主要是对pvc的检查。
  2. 通过findNodesThatFit预选策略选出满足调度条件的node列表。
  3. 通过PrioritizeNodes优选策略给预选的node列表中的node进行打分。
  4. 在打分的node列表中选择一个分数最高的node作为调度的节点。

参考:

11.5 -

11.5.1 -

kubelet源码分析(一)之 NewKubeletCommand

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析 https://github.com/kubernetes/kubernetes/tree/v1.12.0/cmd/kubelet 部分的代码。

本文主要分析 kubernetes/cmd/kubelet部分,该部分主要涉及kubelet的参数解析,及初始化和构造相关的依赖组件(主要在kubeDeps结构体中),并没有kubelet运行的详细逻辑,该部分位于kubernetes/pkg/kubelet模块,待后续文章分析。

kubeletcmd代码目录结构如下:

kubelet
├── app
│   ├── auth.go
│   ├── init_others.go
│   ├── init_windows.go
│   ├── options              # 包括kubelet使用到的option
│   │   ├── container_runtime.go
│   │   ├── globalflags.go
│   │   ├── globalflags_linux.go
│   │   ├── globalflags_other.go
│   │   ├── options.go     # 包括KubeletFlags、AddFlags、AddKubeletConfigFlags等
│   │   ├── osflags_others.go
│   │   └── osflags_windows.go
│   ├── plugins.go
│   ├── server.go # 包括NewKubeletCommand、Run、RunKubelet、CreateAndInitKubelet、startKubelet等
│   ├── server_linux.go
│   └── server_unsupported.go
└── kubelet.go              # kubelet的main入口函数

1. Main 函数

kubelet的入口函数 Main 函数,具体代码参考:https://github.com/kubernetes/kubernetes/blob/v1.12.0/cmd/kubelet/kubelet.go。

func main() {
	rand.Seed(time.Now().UTC().UnixNano())

	command := app.NewKubeletCommand(server.SetupSignalHandler())
	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		fmt.Fprintf(os.Stderr, "%v\n", err)
		os.Exit(1)
	}
}

kubelet代码主要采用了Cobra命令行框架,核心代码如下:

// 初始化命令行
command := app.NewKubeletCommand(server.SetupSignalHandler())
// 执行Execute
err := command.Execute()

2. NewKubeletCommand

NewKubeletCommand基于参数创建了一个*cobra.Command对象。其中核心部分代码为参数解析部分和Run函数。

// NewKubeletCommand creates a *cobra.Command object with default parameters
func NewKubeletCommand(stopCh <-chan struct{}) *cobra.Command {
	...
	cmd := &cobra.Command{
		Use: componentKubelet,
		Long: `...`,
		// The Kubelet has special flag parsing requirements to enforce flag precedence rules,
		// so we do all our parsing manually in Run, below.
		// DisableFlagParsing=true provides the full set of flags passed to the kubelet in the
		// `args` arg to Run, without Cobra's interference.
		DisableFlagParsing: true,
		Run: func(cmd *cobra.Command, args []string) {
			...
			// run the kubelet
			glog.V(5).Infof("KubeletConfiguration: %#v", kubeletServer.KubeletConfiguration)
			if err := Run(kubeletServer, kubeletDeps, stopCh); err != nil {
				glog.Fatal(err)
			}
		},
	}
	...
	return cmd
}

2.1. 参数解析

kubelet开启了DisableFlagParsing参数,没有使用Cobra框架中的默认参数解析,而是自定义参数解析。

2.1.1. 初始化参数和配置

初始化参数解析,初始化cleanFlagSetkubeletFlagskubeletConfig

cleanFlagSet := pflag.NewFlagSet(componentKubelet, pflag.ContinueOnError)
cleanFlagSet.SetNormalizeFunc(flag.WordSepNormalizeFunc)
kubeletFlags := options.NewKubeletFlags()
kubeletConfig, err := options.NewKubeletConfiguration()

2.1.2. 打印帮助信息和版本信息

如果输入非法参数则打印使用帮助信息。

// initial flag parse, since we disable cobra's flag parsing
if err := cleanFlagSet.Parse(args); err != nil {
	cmd.Usage()
	glog.Fatal(err)
}

// check if there are non-flag arguments in the command line
cmds := cleanFlagSet.Args()
if len(cmds) > 0 {
	cmd.Usage()
	glog.Fatalf("unknown command: %s", cmds[0])
}

遇到helpversion参数则打印相关内容并退出。

// short-circuit on help
help, err := cleanFlagSet.GetBool("help")
if err != nil {
	glog.Fatal(`"help" flag is non-bool, programmer error, please correct`)
}
if help {
	cmd.Help()
	return
}

// short-circuit on verflag
verflag.PrintAndExitIfRequested()
utilflag.PrintFlags(cleanFlagSet)

2.1.3. kubelet config

加载并校验kubelet config。其中包括校验初始化的kubeletFlags,并从kubeletFlagsKubeletConfigFile参数获取kubelet config的内容。

// set feature gates from initial flags-based config
if err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {
	glog.Fatal(err)
}

// validate the initial KubeletFlags
if err := options.ValidateKubeletFlags(kubeletFlags); err != nil {
	glog.Fatal(err)
}

if kubeletFlags.ContainerRuntime == "remote" && cleanFlagSet.Changed("pod-infra-container-image") {
	glog.Warning("Warning: For remote container runtime, --pod-infra-container-image is ignored in kubelet, which should be set in that remote runtime instead")
}

// load kubelet config file, if provided
if configFile := kubeletFlags.KubeletConfigFile; len(configFile) > 0 {
	kubeletConfig, err = loadConfigFile(configFile)
	if err != nil {
		glog.Fatal(err)
	}
	// We must enforce flag precedence by re-parsing the command line into the new object.
	// This is necessary to preserve backwards-compatibility across binary upgrades.
	// See issue #56171 for more details.
	if err := kubeletConfigFlagPrecedence(kubeletConfig, args); err != nil {
		glog.Fatal(err)
	}
	// update feature gates based on new config
	if err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {
		glog.Fatal(err)
	}
}

// We always validate the local configuration (command line + config file).
// This is the default "last-known-good" config for dynamic config, and must always remain valid.
if err := kubeletconfigvalidation.ValidateKubeletConfiguration(kubeletConfig); err != nil {
	glog.Fatal(err)
}

2.1.4. dynamic kubelet config

如果开启使用动态kubelet的配置,则由动态配置文件替换kubelet配置文件。

// use dynamic kubelet config, if enabled
var kubeletConfigController *dynamickubeletconfig.Controller
if dynamicConfigDir := kubeletFlags.DynamicConfigDir.Value(); len(dynamicConfigDir) > 0 {
	var dynamicKubeletConfig *kubeletconfiginternal.KubeletConfiguration
	dynamicKubeletConfig, kubeletConfigController, err = BootstrapKubeletConfigController(dynamicConfigDir,
		func(kc *kubeletconfiginternal.KubeletConfiguration) error {
			// Here, we enforce flag precedence inside the controller, prior to the controller's validation sequence,
			// so that we get a complete validation at the same point where we can decide to reject dynamic config.
			// This fixes the flag-precedence component of issue #63305.
			// See issue #56171 for general details on flag precedence.
			return kubeletConfigFlagPrecedence(kc, args)
		})
	if err != nil {
		glog.Fatal(err)
	}
	// If we should just use our existing, local config, the controller will return a nil config
	if dynamicKubeletConfig != nil {
		kubeletConfig = dynamicKubeletConfig
		// Note: flag precedence was already enforced in the controller, prior to validation,
		// by our above transform function. Now we simply update feature gates from the new config.
		if err := utilfeature.DefaultFeatureGate.SetFromMap(kubeletConfig.FeatureGates); err != nil {
			glog.Fatal(err)
		}
	}
}

总结:以上通过对各种特定参数的解析,最终生成kubeletFlagskubeletConfig两个重要的参数对象,用来构造kubeletServer和其他需求。

2.2. 初始化kubeletServer和kubeletDeps

2.2.1. kubeletServer

// construct a KubeletServer from kubeletFlags and kubeletConfig
kubeletServer := &options.KubeletServer{
	KubeletFlags:         *kubeletFlags,
	KubeletConfiguration: *kubeletConfig,
}

2.2.2. kubeletDeps

// use kubeletServer to construct the default KubeletDeps
kubeletDeps, err := UnsecuredDependencies(kubeletServer)
if err != nil {
	glog.Fatal(err)
}

// add the kubelet config controller to kubeletDeps
kubeletDeps.KubeletConfigController = kubeletConfigController

2.2.3. docker shim

如果开启了docker shim参数,则执行RunDockershim

// start the experimental docker shim, if enabled
if kubeletServer.KubeletFlags.ExperimentalDockershim {
	if err := RunDockershim(&kubeletServer.KubeletFlags, kubeletConfig, stopCh); err != nil {
		glog.Fatal(err)
	}
	return
}

2.3. AddFlags

// keep cleanFlagSet separate, so Cobra doesn't pollute it with the global flags
kubeletFlags.AddFlags(cleanFlagSet)
options.AddKubeletConfigFlags(cleanFlagSet, kubeletConfig)
options.AddGlobalFlags(cleanFlagSet)
cleanFlagSet.BoolP("help", "h", false, fmt.Sprintf("help for %s", cmd.Name()))

// ugly, but necessary, because Cobra's default UsageFunc and HelpFunc pollute the flagset with global flags
const usageFmt = "Usage:\n  %s\n\nFlags:\n%s"
cmd.SetUsageFunc(func(cmd *cobra.Command) error {
	fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine(), cleanFlagSet.FlagUsagesWrapped(2))
	return nil
})
cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
	fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine(), cleanFlagSet.FlagUsagesWrapped(2))
})

其中:

2.4. 运行kubelet

运行kubelet并且不退出。由Run函数进入后续的操作。

// run the kubelet
glog.V(5).Infof("KubeletConfiguration: %#v", kubeletServer.KubeletConfiguration)
if err := Run(kubeletServer, kubeletDeps, stopCh); err != nil {
	glog.Fatal(err)
}

3. Run

// Run runs the specified KubeletServer with the given Dependencies. This should never exit.
// The kubeDeps argument may be nil - if so, it is initialized from the settings on KubeletServer.
// Otherwise, the caller is assumed to have set up the Dependencies object and a default one will
// not be generated.
func Run(s *options.KubeletServer, kubeDeps *kubelet.Dependencies, stopCh <-chan struct{}) error {
	// To help debugging, immediately log version
	glog.Infof("Version: %+v", version.Get())
	if err := initForOS(s.KubeletFlags.WindowsService); err != nil {
		return fmt.Errorf("failed OS init: %v", err)
	}
	if err := run(s, kubeDeps, stopCh); err != nil {
		return fmt.Errorf("failed to run Kubelet: %v", err)
	}
	return nil
}

当运行环境是Windows的时候,初始化操作,但是该操作为空,只是预留。具体执行run(s, kubeDeps, stopCh)函数。

3.1. 构造kubeDeps

3.1.1. clientConfig

创建clientConfig,该对象用来创建各种的kubeDeps属性中包含的client

clientConfig, err := createAPIServerClientConfig(s)
if err != nil {
	return fmt.Errorf("invalid kubeconfig: %v", err)
}

3.1.2. kubeClient

kubeClient, err = clientset.NewForConfig(clientConfig)
if err != nil {
	glog.Warningf("New kubeClient from clientConfig error: %v", err)
} else if kubeClient.CertificatesV1beta1() != nil && clientCertificateManager != nil {
	glog.V(2).Info("Starting client certificate rotation.")
	clientCertificateManager.SetCertificateSigningRequestClient(kubeClient.CertificatesV1beta1().CertificateSigningRequests())
	clientCertificateManager.Start()
}

3.1.3. dynamicKubeClient

dynamicKubeClient, err = dynamic.NewForConfig(clientConfig)
if err != nil {
	glog.Warningf("Failed to initialize dynamic KubeClient: %v", err)
}

3.1.4. eventClient

// make a separate client for events
eventClientConfig := *clientConfig
eventClientConfig.QPS = float32(s.EventRecordQPS)
eventClientConfig.Burst = int(s.EventBurst)
eventClient, err = v1core.NewForConfig(&eventClientConfig)
if err != nil {
	glog.Warningf("Failed to create API Server client for Events: %v", err)
}

3.1.5. heartbeatClient

// make a separate client for heartbeat with throttling disabled and a timeout attached
heartbeatClientConfig := *clientConfig
heartbeatClientConfig.Timeout = s.KubeletConfiguration.NodeStatusUpdateFrequency.Duration
// if the NodeLease feature is enabled, the timeout is the minimum of the lease duration and status update frequency
if utilfeature.DefaultFeatureGate.Enabled(features.NodeLease) {
	leaseTimeout := time.Duration(s.KubeletConfiguration.NodeLeaseDurationSeconds) * time.Second
	if heartbeatClientConfig.Timeout > leaseTimeout {
		heartbeatClientConfig.Timeout = leaseTimeout
	}
}
heartbeatClientConfig.QPS = float32(-1)
heartbeatClient, err = clientset.NewForConfig(&heartbeatClientConfig)
if err != nil {
	glog.Warningf("Failed to create API Server client for heartbeat: %v", err)
}

3.1.6. csiClient

// csiClient works with CRDs that support json only
clientConfig.ContentType = "application/json"
csiClient, err := csiclientset.NewForConfig(clientConfig)
if err != nil {
	glog.Warningf("Failed to create CSI API client: %v", err)
}

client赋值

kubeDeps.KubeClient = kubeClient
kubeDeps.DynamicKubeClient = dynamicKubeClient
if heartbeatClient != nil {
	kubeDeps.HeartbeatClient = heartbeatClient
	kubeDeps.OnHeartbeatFailure = closeAllConns
}
if eventClient != nil {
	kubeDeps.EventClient = eventClient
}
kubeDeps.CSIClient = csiClient

3.1.7. CAdvisorInterface

if kubeDeps.CAdvisorInterface == nil {
	imageFsInfoProvider := cadvisor.NewImageFsInfoProvider(s.ContainerRuntime, s.RemoteRuntimeEndpoint)
	kubeDeps.CAdvisorInterface, err = cadvisor.New(imageFsInfoProvider, s.RootDirectory, cadvisor.UsingLegacyCadvisorStats(s.ContainerRuntime, s.RemoteRuntimeEndpoint))
	if err != nil {
		return err
	}
}

3.1.8. ContainerManager

if kubeDeps.ContainerManager == nil {
	if s.CgroupsPerQOS && s.CgroupRoot == "" {
		glog.Infof("--cgroups-per-qos enabled, but --cgroup-root was not specified.  defaulting to /")
		s.CgroupRoot = "/"
	}
	kubeReserved, err := parseResourceList(s.KubeReserved)
	if err != nil {
		return err
	}
	systemReserved, err := parseResourceList(s.SystemReserved)
	if err != nil {
		return err
	}
	var hardEvictionThresholds []evictionapi.Threshold
	// If the user requested to ignore eviction thresholds, then do not set valid values for hardEvictionThresholds here.
	if !s.ExperimentalNodeAllocatableIgnoreEvictionThreshold {
		hardEvictionThresholds, err = eviction.ParseThresholdConfig([]string{}, s.EvictionHard, nil, nil, nil)
		if err != nil {
			return err
		}
	}
	experimentalQOSReserved, err := cm.ParseQOSReserved(s.QOSReserved)
	if err != nil {
		return err
	}

	devicePluginEnabled := utilfeature.DefaultFeatureGate.Enabled(features.DevicePlugins)

	kubeDeps.ContainerManager, err = cm.NewContainerManager(
		kubeDeps.Mounter,
		kubeDeps.CAdvisorInterface,
		cm.NodeConfig{
			RuntimeCgroupsName:    s.RuntimeCgroups,
			SystemCgroupsName:     s.SystemCgroups,
			KubeletCgroupsName:    s.KubeletCgroups,
			ContainerRuntime:      s.ContainerRuntime,
			CgroupsPerQOS:         s.CgroupsPerQOS,
			CgroupRoot:            s.CgroupRoot,
			CgroupDriver:          s.CgroupDriver,
			KubeletRootDir:        s.RootDirectory,
			ProtectKernelDefaults: s.ProtectKernelDefaults,
			NodeAllocatableConfig: cm.NodeAllocatableConfig{
				KubeReservedCgroupName:   s.KubeReservedCgroup,
				SystemReservedCgroupName: s.SystemReservedCgroup,
				EnforceNodeAllocatable:   sets.NewString(s.EnforceNodeAllocatable...),
				KubeReserved:             kubeReserved,
				SystemReserved:           systemReserved,
				HardEvictionThresholds:   hardEvictionThresholds,
			},
			QOSReserved:                           *experimentalQOSReserved,
			ExperimentalCPUManagerPolicy:          s.CPUManagerPolicy,
			ExperimentalCPUManagerReconcilePeriod: s.CPUManagerReconcilePeriod.Duration,
			ExperimentalPodPidsLimit:              s.PodPidsLimit,
			EnforceCPULimits:                      s.CPUCFSQuota,
			CPUCFSQuotaPeriod:                     s.CPUCFSQuotaPeriod.Duration,
		},
		s.FailSwapOn,
		devicePluginEnabled,
		kubeDeps.Recorder)

	if err != nil {
		return err
	}
}

3.1.9. oomAdjuster

// TODO(vmarmol): Do this through container config.
oomAdjuster := kubeDeps.OOMAdjuster
if err := oomAdjuster.ApplyOOMScoreAdj(0, int(s.OOMScoreAdj)); err != nil {
	glog.Warning(err)
}

3.2. Health check

if s.HealthzPort > 0 {
	healthz.DefaultHealthz()
	go wait.Until(func() {
		err := http.ListenAndServe(net.JoinHostPort(s.HealthzBindAddress, strconv.Itoa(int(s.HealthzPort))), nil)
		if err != nil {
			glog.Errorf("Starting health server failed: %v", err)
		}
	}, 5*time.Second, wait.NeverStop)
}

3.3. RunKubelet

通过各种赋值构造了完整的kubeDeps结构体,最后再执行RunKubelet转入后续的kubelet执行流程。

if err := RunKubelet(s, kubeDeps, s.RunOnce); err != nil {
	return err
}

4. RunKubelet

// RunKubelet is responsible for setting up and running a kubelet.  It is used in three different applications:
//   1 Integration tests
//   2 Kubelet binary
//   3 Standalone 'kubernetes' binary
// Eventually, #2 will be replaced with instances of #3
func RunKubelet(kubeServer *options.KubeletServer, kubeDeps *kubelet.Dependencies, runOnce bool) error {
	...
	k, err := CreateAndInitKubelet(&kubeServer.KubeletConfiguration,
		...
		kubeServer.NodeStatusMaxImages)
	if err != nil {
		return fmt.Errorf("failed to create kubelet: %v", err)
	}

	// NewMainKubelet should have set up a pod source config if one didn't exist
	// when the builder was run. This is just a precaution.
	if kubeDeps.PodConfig == nil {
		return fmt.Errorf("failed to create kubelet, pod source config was nil")
	}
	podCfg := kubeDeps.PodConfig

	rlimit.RlimitNumFiles(uint64(kubeServer.MaxOpenFiles))

	// process pods and exit.
	if runOnce {
		if _, err := k.RunOnce(podCfg.Updates()); err != nil {
			return fmt.Errorf("runonce failed: %v", err)
		}
		glog.Infof("Started kubelet as runonce")
	} else {
		startKubelet(k, podCfg, &kubeServer.KubeletConfiguration, kubeDeps, kubeServer.EnableServer)
		glog.Infof("Started kubelet")
	}
	return nil
}  

RunKubelet函数核心代码为执行了CreateAndInitKubeletstartKubelet两个函数的操作,以下对这两个函数进行分析。

4.1. CreateAndInitKubelet

通过传入kubeDeps调用CreateAndInitKubelet初始化Kubelet。

k, err := CreateAndInitKubelet(&kubeServer.KubeletConfiguration,
	kubeDeps,
	&kubeServer.ContainerRuntimeOptions,
	kubeServer.ContainerRuntime,
	kubeServer.RuntimeCgroups,
	kubeServer.HostnameOverride,
	kubeServer.NodeIP,
	kubeServer.ProviderID,
	kubeServer.CloudProvider,
	kubeServer.CertDirectory,
	kubeServer.RootDirectory,
	kubeServer.RegisterNode,
	kubeServer.RegisterWithTaints,
	kubeServer.AllowedUnsafeSysctls,
	kubeServer.RemoteRuntimeEndpoint,
	kubeServer.RemoteImageEndpoint,
	kubeServer.ExperimentalMounterPath,
	kubeServer.ExperimentalKernelMemcgNotification,
	kubeServer.ExperimentalCheckNodeCapabilitiesBeforeMount,
	kubeServer.ExperimentalNodeAllocatableIgnoreEvictionThreshold,
	kubeServer.MinimumGCAge,
	kubeServer.MaxPerPodContainerCount,
	kubeServer.MaxContainerCount,
	kubeServer.MasterServiceNamespace,
	kubeServer.RegisterSchedulable,
	kubeServer.NonMasqueradeCIDR,
	kubeServer.KeepTerminatedPodVolumes,
	kubeServer.NodeLabels,
	kubeServer.SeccompProfileRoot,
	kubeServer.BootstrapCheckpointPath,
	kubeServer.NodeStatusMaxImages)
if err != nil {
	return fmt.Errorf("failed to create kubelet: %v", err)
}

4.1.1. NewMainKubelet

CreateAndInitKubelet方法中执行的核心函数是NewMainKubeletNewMainKubelet实例化一个kubelet对象,该部分的具体代码在kubernetes/pkg/kubelet中,具体参考:kubernetes/pkg/kubelet/kubelet.go#L325

func CreateAndInitKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
	...
	nodeStatusMaxImages int32) (k kubelet.Bootstrap, err error) {
	// TODO: block until all sources have delivered at least one update to the channel, or break the sync loop
	// up into "per source" synchronizations

	k, err = kubelet.NewMainKubelet(kubeCfg,
		kubeDeps,
		crOptions,
		containerRuntime,
		runtimeCgroups,
		hostnameOverride,
		nodeIP,
		providerID,
		cloudProvider,
		certDirectory,
		rootDirectory,
		registerNode,
		registerWithTaints,
		allowedUnsafeSysctls,
		remoteRuntimeEndpoint,
		remoteImageEndpoint,
		experimentalMounterPath,
		experimentalKernelMemcgNotification,
		experimentalCheckNodeCapabilitiesBeforeMount,
		experimentalNodeAllocatableIgnoreEvictionThreshold,
		minimumGCAge,
		maxPerPodContainerCount,
		maxContainerCount,
		masterServiceNamespace,
		registerSchedulable,
		nonMasqueradeCIDR,
		keepTerminatedPodVolumes,
		nodeLabels,
		seccompProfileRoot,
		bootstrapCheckpointPath,
		nodeStatusMaxImages)
	if err != nil {
		return nil, err
	}

	k.BirthCry()

	k.StartGarbageCollection()

	return k, nil
}

4.1.2. PodConfig

if kubeDeps.PodConfig == nil {
	var err error
	kubeDeps.PodConfig, err = makePodSourceConfig(kubeCfg, kubeDeps, nodeName, bootstrapCheckpointPath)
	if err != nil {
		return nil, err
	}
}

NewMainKubelet-->PodConfig-->NewPodConfig-->kubetypes.PodUpdate。会生成一个podUpdate的channel来监听pod的变化,该channel会在k.Run(podCfg.Updates())中作为关键入参。

4.2. startKubelet

// process pods and exit.
if runOnce {
	if _, err := k.RunOnce(podCfg.Updates()); err != nil {
		return fmt.Errorf("runonce failed: %v", err)
	}
	glog.Infof("Started kubelet as runonce")
} else {
	startKubelet(k, podCfg, &kubeServer.KubeletConfiguration, kubeDeps, kubeServer.EnableServer)
	glog.Infof("Started kubelet")
}

如果设置了只运行一次的参数,则执行k.RunOnce,否则执行核心函数startKubelet。具体实现如下:

func startKubelet(k kubelet.Bootstrap, podCfg *config.PodConfig, kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *kubelet.Dependencies, enableServer bool) {
	// start the kubelet
	go wait.Until(func() {
		k.Run(podCfg.Updates())
	}, 0, wait.NeverStop)

	// start the kubelet server
	if enableServer {
		go k.ListenAndServe(net.ParseIP(kubeCfg.Address), uint(kubeCfg.Port), kubeDeps.TLSOptions, kubeDeps.Auth, kubeCfg.EnableDebuggingHandlers, kubeCfg.EnableContentionProfiling)

	}
	if kubeCfg.ReadOnlyPort > 0 {
		go k.ListenAndServeReadOnly(net.ParseIP(kubeCfg.Address), uint(kubeCfg.ReadOnlyPort))
	}
}

4.2.1. k.Run

// start the kubelet
go wait.Until(func() {
	k.Run(podCfg.Updates())
}, 0, wait.NeverStop)

通过长驻进程的方式运行k.Run,不退出,将kubelet的运行逻辑引入kubernetes/pkg/kubelet/kubelet.go部分,kubernetes/pkg/kubelet部分的运行逻辑待后续文章分析。

5. 总结

  1. kubelet采用Cobra命令行框架和pflag参数解析框架,和apiserver、scheduler、controller-manager形成统一的代码风格。

  2. kubernetes/cmd/kubelet部分主要对运行参数进行定义和解析,初始化和构造相关的依赖组件(主要在kubeDeps结构体中),并没有kubelet运行的详细逻辑,该部分位于kubernetes/pkg/kubelet模块。

  3. cmd部分调用流程如下:Main-->NewKubeletCommand-->Run(kubeletServer, kubeletDeps, stopCh)-->run(s *options.KubeletServer, kubeDeps ..., stopCh ...)--> RunKubelet(s, kubeDeps, s.RunOnce)-->startKubelet-->k.Run(podCfg.Updates())-->pkg/kubelet

    同时RunKubelet(s, kubeDeps, s.RunOnce)-->CreateAndInitKubelet-->kubelet.NewMainKubelet-->pkg/kubelet

参考文章:

11.5.2 -

kubelet源码分析(二)之 NewMainKubelet

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析 https://github.com/kubernetes/kubernetes/tree/v1.12.0/pkg/kubelet 部分的代码。

本文主要分析kubelet中的NewMainKubelet部分。

1. NewMainKubelet

NewMainKubelet主要用来初始化和构造一个kubelet结构体,kubelet结构体定义参考:https://github.com/kubernetes/kubernetes/blob/v1.12.0/pkg/kubelet/kubelet.go#L888

// NewMainKubelet instantiates a new Kubelet object along with all the required internal modules.
// No initialization of Kubelet and its modules should happen here.
func NewMainKubelet(kubeCfg *kubeletconfiginternal.KubeletConfiguration,
	kubeDeps *Dependencies,
	crOptions *config.ContainerRuntimeOptions,
	containerRuntime string,
	runtimeCgroups string,
	hostnameOverride string,
	nodeIP string,
	providerID string,
	cloudProvider string,
	certDirectory string,
	rootDirectory string,
	registerNode bool,
	registerWithTaints []api.Taint,
	allowedUnsafeSysctls []string,
	remoteRuntimeEndpoint string,
	remoteImageEndpoint string,
	experimentalMounterPath string,
	experimentalKernelMemcgNotification bool,
	experimentalCheckNodeCapabilitiesBeforeMount bool,
	experimentalNodeAllocatableIgnoreEvictionThreshold bool,
	minimumGCAge metav1.Duration,
	maxPerPodContainerCount int32,
	maxContainerCount int32,
	masterServiceNamespace string,
	registerSchedulable bool,
	nonMasqueradeCIDR string,
	keepTerminatedPodVolumes bool,
	nodeLabels map[string]string,
	seccompProfileRoot string,
	bootstrapCheckpointPath string,
	nodeStatusMaxImages int32) (*Kubelet, error) {
    ...
}    

1.1. PodConfig

通过makePodSourceConfig生成Pod config。

if kubeDeps.PodConfig == nil {
	var err error
	kubeDeps.PodConfig, err = makePodSourceConfig(kubeCfg, kubeDeps, nodeName, bootstrapCheckpointPath)
	if err != nil {
		return nil, err
	}
}

1.1.1. makePodSourceConfig

// makePodSourceConfig creates a config.PodConfig from the given
// KubeletConfiguration or returns an error.
func makePodSourceConfig(kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *Dependencies, nodeName types.NodeName, bootstrapCheckpointPath string) (*config.PodConfig, error) {
	...
	// source of all configuration
	cfg := config.NewPodConfig(config.PodConfigNotificationIncremental, kubeDeps.Recorder)
	
    // define file config source
	if kubeCfg.StaticPodPath != "" {
		glog.Infof("Adding pod path: %v", kubeCfg.StaticPodPath)
		config.NewSourceFile(kubeCfg.StaticPodPath, nodeName, kubeCfg.FileCheckFrequency.Duration, cfg.Channel(kubetypes.FileSource))
	}

	// define url config source
	if kubeCfg.StaticPodURL != "" {
		glog.Infof("Adding pod url %q with HTTP header %v", kubeCfg.StaticPodURL, manifestURLHeader)
		config.NewSourceURL(kubeCfg.StaticPodURL, manifestURLHeader, nodeName, kubeCfg.HTTPCheckFrequency.Duration, cfg.Channel(kubetypes.HTTPSource))
	}
    
	// Restore from the checkpoint path
	// NOTE: This MUST happen before creating the apiserver source
	// below, or the checkpoint would override the source of truth.
	...
	if kubeDeps.KubeClient != nil {
		glog.Infof("Watching apiserver")
		if updatechannel == nil {
			updatechannel = cfg.Channel(kubetypes.ApiserverSource)
		}
		config.NewSourceApiserver(kubeDeps.KubeClient, nodeName, updatechannel)
	}
	return cfg, nil
}

1.1.2. NewPodConfig

// NewPodConfig creates an object that can merge many configuration sources into a stream
// of normalized updates to a pod configuration.
func NewPodConfig(mode PodConfigNotificationMode, recorder record.EventRecorder) *PodConfig {
	updates := make(chan kubetypes.PodUpdate, 50)
	storage := newPodStorage(updates, mode, recorder)
	podConfig := &PodConfig{
		pods:    storage,
		mux:     config.NewMux(storage),
		updates: updates,
		sources: sets.String{},
	}
	return podConfig
}

1.1.3. NewSourceApiserver

// NewSourceApiserver creates a config source that watches and pulls from the apiserver.
func NewSourceApiserver(c clientset.Interface, nodeName types.NodeName, updates chan<- interface{}) {
	lw := cache.NewListWatchFromClient(c.CoreV1().RESTClient(), "pods", metav1.NamespaceAll, fields.OneTermEqualSelector(api.PodHostField, string(nodeName)))
	newSourceApiserverFromLW(lw, updates)
}

1.2. Lister

serviceListernodeLister分别通过List-Watch机制监听servicenode的列表变化。

1.2.1. serviceLister

serviceIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{cache.NamespaceIndex: cache.MetaNamespaceIndexFunc})
if kubeDeps.KubeClient != nil {
	serviceLW := cache.NewListWatchFromClient(kubeDeps.KubeClient.CoreV1().RESTClient(), "services", metav1.NamespaceAll, fields.Everything())
	r := cache.NewReflector(serviceLW, &v1.Service{}, serviceIndexer, 0)
	go r.Run(wait.NeverStop)
}
serviceLister := corelisters.NewServiceLister(serviceIndexer)

1.2.2. nodeLister

nodeIndexer := cache.NewIndexer(cache.MetaNamespaceKeyFunc, cache.Indexers{})
if kubeDeps.KubeClient != nil {
	fieldSelector := fields.Set{api.ObjectNameField: string(nodeName)}.AsSelector()
	nodeLW := cache.NewListWatchFromClient(kubeDeps.KubeClient.CoreV1().RESTClient(), "nodes", metav1.NamespaceAll, fieldSelector)
	r := cache.NewReflector(nodeLW, &v1.Node{}, nodeIndexer, 0)
	go r.Run(wait.NeverStop)
}
nodeInfo := &predicates.CachedNodeInfo{NodeLister: corelisters.NewNodeLister(nodeIndexer)}

1.3. 各种Manager

1.3.1. containerRefManager

containerRefManager := kubecontainer.NewRefManager()

1.3.2. oomWatcher

oomWatcher := NewOOMWatcher(kubeDeps.CAdvisorInterface, kubeDeps.Recorder)

1.3.3. dnsConfigurer

clusterDNS := make([]net.IP, 0, len(kubeCfg.ClusterDNS))
for _, ipEntry := range kubeCfg.ClusterDNS {
	ip := net.ParseIP(ipEntry)
	if ip == nil {
		glog.Warningf("Invalid clusterDNS ip '%q'", ipEntry)
	} else {
		clusterDNS = append(clusterDNS, ip)
	}
}
...

dns.NewConfigurer(kubeDeps.Recorder, nodeRef, parsedNodeIP, clusterDNS, kubeCfg.ClusterDomain, kubeCfg.ResolverConfig),

1.3.4. secretManager & configMapManager

var secretManager secret.Manager
var configMapManager configmap.Manager
switch kubeCfg.ConfigMapAndSecretChangeDetectionStrategy {
case kubeletconfiginternal.WatchChangeDetectionStrategy:
	secretManager = secret.NewWatchingSecretManager(kubeDeps.KubeClient)
	configMapManager = configmap.NewWatchingConfigMapManager(kubeDeps.KubeClient)
case kubeletconfiginternal.TTLCacheChangeDetectionStrategy:
	secretManager = secret.NewCachingSecretManager(
		kubeDeps.KubeClient, manager.GetObjectTTLFromNodeFunc(klet.GetNode))
	configMapManager = configmap.NewCachingConfigMapManager(
		kubeDeps.KubeClient, manager.GetObjectTTLFromNodeFunc(klet.GetNode))
case kubeletconfiginternal.GetChangeDetectionStrategy:
	secretManager = secret.NewSimpleSecretManager(kubeDeps.KubeClient)
	configMapManager = configmap.NewSimpleConfigMapManager(kubeDeps.KubeClient)
default:
	return nil, fmt.Errorf("unknown configmap and secret manager mode: %v", kubeCfg.ConfigMapAndSecretChangeDetectionStrategy)
}

klet.secretManager = secretManager
klet.configMapManager = configMapManager

1.3.5. livenessManager

klet.livenessManager = proberesults.NewManager()

1.3.6. podManager

// podManager is also responsible for keeping secretManager and configMapManager contents up-to-date.
klet.podManager = kubepod.NewBasicPodManager(kubepod.NewBasicMirrorClient(klet.kubeClient), secretManager, configMapManager, checkpointManager)

1.3.7. resourceAnalyzer

klet.resourceAnalyzer = serverstats.NewResourceAnalyzer(klet, kubeCfg.VolumeStatsAggPeriod.Duration)

1.3.8. containerGC

// setup containerGC
containerGC, err := kubecontainer.NewContainerGC(klet.containerRuntime, containerGCPolicy, klet.sourcesReady)
if err != nil {
	return nil, err
}
klet.containerGC = containerGC
klet.containerDeletor = newPodContainerDeletor(klet.containerRuntime, integer.IntMax(containerGCPolicy.MaxPerPodContainer, minDeadContainerInPod))

1.3.9. imageManager

// setup imageManager
imageManager, err := images.NewImageGCManager(klet.containerRuntime, klet.StatsProvider, kubeDeps.Recorder, nodeRef, imageGCPolicy, crOptions.PodSandboxImage)
if err != nil {
	return nil, fmt.Errorf("failed to initialize image manager: %v", err)
}
klet.imageManager = imageManager

1.3.10. statusManager

klet.statusManager = status.NewManager(klet.kubeClient, klet.podManager, klet)

1.3.11. probeManager

klet.probeManager = prober.NewManager(
	klet.statusManager,
	klet.livenessManager,
	klet.runner,
	containerRefManager,
	kubeDeps.Recorder)

1.3.12. tokenManager

tokenManager := token.NewManager(kubeDeps.KubeClient)

1.3.13. volumePluginMgr

klet.volumePluginMgr, err =
	NewInitializedVolumePluginMgr(klet, secretManager, configMapManager, tokenManager, kubeDeps.VolumePlugins, kubeDeps.DynamicPluginProber)
if err != nil {
	return nil, err
}
if klet.enablePluginsWatcher {
	klet.pluginWatcher = pluginwatcher.NewWatcher(klet.getPluginsDir())
}

1.3.14. volumeManager

// setup volumeManager
klet.volumeManager = volumemanager.NewVolumeManager(
	kubeCfg.EnableControllerAttachDetach,
	nodeName,
	klet.podManager,
	klet.statusManager,
	klet.kubeClient,
	klet.volumePluginMgr,
	klet.containerRuntime,
	kubeDeps.Mounter,
	klet.getPodsDir(),
	kubeDeps.Recorder,
	experimentalCheckNodeCapabilitiesBeforeMount,
	keepTerminatedPodVolumes)

1.3.15. evictionManager

// setup eviction manager
evictionManager, evictionAdmitHandler := eviction.NewManager(klet.resourceAnalyzer, evictionConfig, killPodNow(klet.podWorkers, kubeDeps.Recorder), klet.imageManager, klet.containerGC, kubeDeps.Recorder, nodeRef, klet.clock)

klet.evictionManager = evictionManager
klet.admitHandlers.AddPodAdmitHandler(evictionAdmitHandler)

1.4. containerRuntime

目前pod所使用的runtime只有dockerremote两种,rkt已经废弃。

if containerRuntime == "rkt" {
	glog.Fatalln("rktnetes has been deprecated in favor of rktlet. Please see https://github.com/kubernetes-incubator/rktlet for more information.")
}

runtimedocker的时候,会执行docker相关操作。

	switch containerRuntime {
	case kubetypes.DockerContainerRuntime:
		// Create and start the CRI shim running as a grpc server.
		...
		// The unix socket for kubelet <-> dockershim communication.
		...
		// Create dockerLegacyService when the logging driver is not supported.
		...
	case kubetypes.RemoteContainerRuntime:
		// No-op.
		break
	default:
		return nil, fmt.Errorf("unsupported CRI runtime: %q", containerRuntime)
	}

1.4.1. NewDockerService

// Create and start the CRI shim running as a grpc server.
streamingConfig := getStreamingConfig(kubeCfg, kubeDeps, crOptions)
ds, err := dockershim.NewDockerService(kubeDeps.DockerClientConfig, crOptions.PodSandboxImage, streamingConfig,
	&pluginSettings, runtimeCgroups, kubeCfg.CgroupDriver, crOptions.DockershimRootDirectory, !crOptions.RedirectContainerStreaming)
if err != nil {
	return nil, err
}
if crOptions.RedirectContainerStreaming {
	klet.criHandler = ds
}

1.4.2. NewDockerServer

// The unix socket for kubelet <-> dockershim communication.
glog.V(5).Infof("RemoteRuntimeEndpoint: %q, RemoteImageEndpoint: %q",
	remoteRuntimeEndpoint,
	remoteImageEndpoint)
glog.V(2).Infof("Starting the GRPC server for the docker CRI shim.")
server := dockerremote.NewDockerServer(remoteRuntimeEndpoint, ds)
if err := server.Start(); err != nil {
	return nil, err
}

1.4.3. DockerServer.Start

// Start starts the dockershim grpc server.
func (s *DockerServer) Start() error {
	// Start the internal service.
	if err := s.service.Start(); err != nil {
		glog.Errorf("Unable to start docker service")
		return err
	}

	glog.V(2).Infof("Start dockershim grpc server")
	l, err := util.CreateListener(s.endpoint)
	if err != nil {
		return fmt.Errorf("failed to listen on %q: %v", s.endpoint, err)
	}
	// Create the grpc server and register runtime and image services.
	s.server = grpc.NewServer(
		grpc.MaxRecvMsgSize(maxMsgSize),
		grpc.MaxSendMsgSize(maxMsgSize),
	)
	runtimeapi.RegisterRuntimeServiceServer(s.server, s.service)
	runtimeapi.RegisterImageServiceServer(s.server, s.service)
	go func() {
		if err := s.server.Serve(l); err != nil {
			glog.Fatalf("Failed to serve connections: %v", err)
		}
	}()
	return nil
}

1.5. podWorker

构造podWorkersworkQueue

klet.workQueue = queue.NewBasicWorkQueue(klet.clock)
klet.podWorkers = newPodWorkers(klet.syncPod, kubeDeps.Recorder, klet.workQueue, klet.resyncInterval, backOffPeriod, klet.podCache)

1.5.1. PodWorkers接口

// PodWorkers is an abstract interface for testability.
type PodWorkers interface {
	UpdatePod(options *UpdatePodOptions)
	ForgetNonExistingPodWorkers(desiredPods map[types.UID]empty)
	ForgetWorker(uid types.UID)
}

podWorker主要用来对pod相应事件进行处理和同步,包含以下三个方法:UpdatePodForgetNonExistingPodWorkersForgetWorker

2. 总结

  1. NewMainKubelet主要用来构造kubelet结构体,其中kubelet除了包含必要的配置和client(例如:kubeClient、csiClient等)外,最主要的包含各种manager来管理不同的任务。

  2. 核心的manager有以下几种:

    • oomWatcher:监控pod内存是否发生OOM。
    • podManager:管理pod的生命周期,包括对pod的增删改查操作等。
    • containerGC:对死亡容器进行垃圾回收。
    • imageManager:对容器镜像进行垃圾回收。
    • statusManager:与apiserver同步pod状态,同时也作状态缓存。
    • volumeManager:对pod的volume进行attached/detached/mounted/unmounted操作。
    • evictionManager:保证节点稳定,必要时对pod进行驱逐(例如资源不足的情况下)。
  3. NewMainKubelet还包含了serviceListernodeLister来监听servicenode的列表变化。

  4. kubelet使用到的containerRuntime目前主要是docker,其中rkt已废弃。NewMainKubelet启动了dockershim grpc server来执行docker相关操作。

  5. 构建了podWorker来对pod相关的更新逻辑进行处理。

参考文章:

11.5.3 -

kubelet源码分析(三)之 startKubelet

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析startKubelet,其中主要是kubelet.Run部分,该部分的内容主要是初始化并运行一些manager。对于kubelet所包含的各种manager的执行逻辑和pod的生命周期管理逻辑待后续文章分析。

后续的文章主要会分类分析pkg/kubelet部分的代码实现。

kubeletpkg代码目录结构:

kubelet
├── apis  # 定义一些相关接口
├── cadvisor # cadvisor
├── cm # ContainerManager、cpu manger、cgroup manager 
├── config
├── configmap # configmap manager
├── container  # Runtime、ImageService
├── dockershim  # docker的相关调用
├── eviction # eviction manager
├── images  # image manager
├── kubeletconfig  
├── kuberuntime # 核心:kubeGenericRuntimeManager、runtime容器的相关操作
├── lifecycle
├── mountpod
├── network  # pod dns
├── nodelease
├── nodestatus  # MachineInfo、节点相关信息
├── pleg  # PodLifecycleEventGenerator
├── pod  # 核心:pod manager、mirror pod
├── preemption
├── qos  # 资源服务质量,不过暂时内容很少
├── remote # RemoteRuntimeService
├── server
├── stats # StatsProvider
├── status # status manager
├── types  # PodUpdate、PodOperation
├── volumemanager # VolumeManager
├── kubelet.go  # 核心: SyncHandler、kubelet的大部分操作
├── kubelet_getters.go # 各种get操作,例如获取相关目录:getRootDir、getPodsDir、getPluginsDir
├── kubelet_network.go # 
├── kubelet_network_linux.go
├── kubelet_node_status.go # registerWithAPIServer、initialNode、syncNodeStatus
├── kubelet_pods.go # 核心:pod的增删改查等相关操作、podKiller、
├── kubelet_resources.go
├── kubelet_volumes.go # ListVolumesForPod、cleanupOrphanedPodDirs
├── oom_watcher.go  # OOMWatcher
├── pod_container_deletor.go
├── pod_workers.go # 核心:PodWorkers、UpdatePodOptions、syncPodOptions、managePodLoop
├── runonce.go  # RunOnce
├── runtime.go
...

1. startKubelet

startKubelet的函数位于cmd/kubelet/app/server.go,启动并运行一个kubelet,运行kubelet的逻辑代码位于pkg/kubelet/kubelet.go

主要内容如下:

  1. 运行一个kubelet,执行kubelet中各种manager的相关逻辑。
  2. 运行kubelet server启动监听服务。

此部分代码位于cmd/kubelet/app/server.go

func startKubelet(k kubelet.Bootstrap, podCfg *config.PodConfig, kubeCfg *kubeletconfiginternal.KubeletConfiguration, kubeDeps *kubelet.Dependencies, enableServer bool) {
	// start the kubelet
	go wait.Until(func() {
		k.Run(podCfg.Updates())
	}, 0, wait.NeverStop)

	// start the kubelet server
	if enableServer {
		go k.ListenAndServe(net.ParseIP(kubeCfg.Address), uint(kubeCfg.Port), kubeDeps.TLSOptions, kubeDeps.Auth, kubeCfg.EnableDebuggingHandlers, kubeCfg.EnableContentionProfiling)

	}
	if kubeCfg.ReadOnlyPort > 0 {
		go k.ListenAndServeReadOnly(net.ParseIP(kubeCfg.Address), uint(kubeCfg.ReadOnlyPort))
	}
}

2. Kubelet.Run

Kubelet.Run方法主要将NewMainKubelet构造的各种manager运行起来,让各种manager执行相应的功能,大部分manager为常驻进程的方式运行。

Kubelet.Run完整代码如下:

此部分代码位于pkg/kubelet/kubelet.go

// Run starts the kubelet reacting to config updates
func (kl *Kubelet) Run(updates <-chan kubetypes.PodUpdate) {
	if kl.logServer == nil {
		kl.logServer = http.StripPrefix("/logs/", http.FileServer(http.Dir("/var/log/")))
	}
	if kl.kubeClient == nil {
		glog.Warning("No api server defined - no node status update will be sent.")
	}

	// Start the cloud provider sync manager
	if kl.cloudResourceSyncManager != nil {
		go kl.cloudResourceSyncManager.Run(wait.NeverStop)
	}

	if err := kl.initializeModules(); err != nil {
		kl.recorder.Eventf(kl.nodeRef, v1.EventTypeWarning, events.KubeletSetupFailed, err.Error())
		glog.Fatal(err)
	}

	// Start volume manager
	go kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)

	if kl.kubeClient != nil {
		// Start syncing node status immediately, this may set up things the runtime needs to run.
		go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
		go kl.fastStatusUpdateOnce()

		// start syncing lease
		if utilfeature.DefaultFeatureGate.Enabled(features.NodeLease) {
			go kl.nodeLeaseController.Run(wait.NeverStop)
		}
	}
	go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)

	// Start loop to sync iptables util rules
	if kl.makeIPTablesUtilChains {
		go wait.Until(kl.syncNetworkUtil, 1*time.Minute, wait.NeverStop)
	}

	// Start a goroutine responsible for killing pods (that are not properly
	// handled by pod workers).
	go wait.Until(kl.podKiller, 1*time.Second, wait.NeverStop)

	// Start component sync loops.
	kl.statusManager.Start()
	kl.probeManager.Start()

	// Start syncing RuntimeClasses if enabled.
	if kl.runtimeClassManager != nil {
		go kl.runtimeClassManager.Run(wait.NeverStop)
	}

	// Start the pod lifecycle event generator.
	kl.pleg.Start()
	kl.syncLoop(updates, kl)
}

以下对Kubelet.Run分段进行分析。

3. initializeModules

initializeModules包含了imageManagerserverCertificateManageroomWatcherresourceAnalyzer

主要流程如下:

  1. 创建文件系统目录,包括kubelet的root目录、pods的目录、plugins的目录和容器日志目录。
  2. 启动imageManager、serverCertificateManager、oomWatcher、resourceAnalyzer。

各种manager的说明如下:

  • imageManager:负责镜像垃圾回收。
  • serverCertificateManager:负责处理证书。
  • oomWatcher:监控内存使用,是否发生内存耗尽。
  • resourceAnalyzer:监控资源使用情况。

完整代码如下:

此部分代码位于pkg/kubelet/kubelet.go

// initializeModules will initialize internal modules that do not require the container runtime to be up.
// Note that the modules here must not depend on modules that are not initialized here.
func (kl *Kubelet) initializeModules() error {
	// Prometheus metrics.
	metrics.Register(kl.runtimeCache, collectors.NewVolumeStatsCollector(kl))

	// Setup filesystem directories.
	if err := kl.setupDataDirs(); err != nil {
		return err
	}

	// If the container logs directory does not exist, create it.
	if _, err := os.Stat(ContainerLogsDir); err != nil {
		if err := kl.os.MkdirAll(ContainerLogsDir, 0755); err != nil {
			glog.Errorf("Failed to create directory %q: %v", ContainerLogsDir, err)
		}
	}

	// Start the image manager.
	kl.imageManager.Start()

	// Start the certificate manager if it was enabled.
	if kl.serverCertificateManager != nil {
		kl.serverCertificateManager.Start()
	}

	// Start out of memory watcher.
	if err := kl.oomWatcher.Start(kl.nodeRef); err != nil {
		return fmt.Errorf("Failed to start OOM watcher %v", err)
	}

	// Start resource analyzer
	kl.resourceAnalyzer.Start()

	return nil
}

3.1. setupDataDirs

initializeModules先创建相关目录。

具体目录如下:

  • ContainerLogsDir:目录为/var/log/containers。
  • rootDirectory:由参数传入,一般为/var/lib/kubelet
  • PodsDir:目录为{rootDirectory}/pods。
  • PluginsDir:目录为{rootDirectory}/plugins。

initializeModules中setupDataDirs的相关代码如下:

// Setup filesystem directories.
if err := kl.setupDataDirs(); err != nil {
	return err
}

// If the container logs directory does not exist, create it.
if _, err := os.Stat(ContainerLogsDir); err != nil {
	if err := kl.os.MkdirAll(ContainerLogsDir, 0755); err != nil {
		glog.Errorf("Failed to create directory %q: %v", ContainerLogsDir, err)
	}
}

setupDataDirs代码如下

// setupDataDirs creates:
// 1.  the root directory
// 2.  the pods directory
// 3.  the plugins directory
func (kl *Kubelet) setupDataDirs() error {
	kl.rootDirectory = path.Clean(kl.rootDirectory)
	if err := os.MkdirAll(kl.getRootDir(), 0750); err != nil {
		return fmt.Errorf("error creating root directory: %v", err)
	}
	if err := kl.mounter.MakeRShared(kl.getRootDir()); err != nil {
		return fmt.Errorf("error configuring root directory: %v", err)
	}
	if err := os.MkdirAll(kl.getPodsDir(), 0750); err != nil {
		return fmt.Errorf("error creating pods directory: %v", err)
	}
	if err := os.MkdirAll(kl.getPluginsDir(), 0750); err != nil {
		return fmt.Errorf("error creating plugins directory: %v", err)
	}
	return nil
}

3.2. manager

initializeModules中的manager如下:

// Start the image manager.
kl.imageManager.Start()

// Start the certificate manager if it was enabled.
if kl.serverCertificateManager != nil {
	kl.serverCertificateManager.Start()
}

// Start out of memory watcher.
if err := kl.oomWatcher.Start(kl.nodeRef); err != nil {
	return fmt.Errorf("Failed to start OOM watcher %v", err)
}

// Start resource analyzer
kl.resourceAnalyzer.Start()

4. 运行各种manager

4.1. volumeManager

volumeManager主要运行一组异步循环,根据在此节点上安排的pod调整哪些volume需要attached/detached/mounted/unmounted

// Start volume manager
go kl.volumeManager.Run(kl.sourcesReady, wait.NeverStop)

volumeManager.Run实现代码如下:

func (vm *volumeManager) Run(sourcesReady config.SourcesReady, stopCh <-chan struct{}) {
	defer runtime.HandleCrash()

	go vm.desiredStateOfWorldPopulator.Run(sourcesReady, stopCh)
	glog.V(2).Infof("The desired_state_of_world populator starts")

	glog.Infof("Starting Kubelet Volume Manager")
	go vm.reconciler.Run(stopCh)

	metrics.Register(vm.actualStateOfWorld, vm.desiredStateOfWorld, vm.volumePluginMgr)

	<-stopCh
	glog.Infof("Shutting down Kubelet Volume Manager")
}

4.2. syncNodeStatus

syncNodeStatus通过goroutine的方式定期执行,它将节点的状态同步给master,必要的时候注册kubelet。

if kl.kubeClient != nil {
	// Start syncing node status immediately, this may set up things the runtime needs to run.
	go wait.Until(kl.syncNodeStatus, kl.nodeStatusUpdateFrequency, wait.NeverStop)
	go kl.fastStatusUpdateOnce()

	// start syncing lease
	if utilfeature.DefaultFeatureGate.Enabled(features.NodeLease) {
		go kl.nodeLeaseController.Run(wait.NeverStop)
	}
}

4.3. updateRuntimeUp

updateRuntimeUp调用容器运行时状态回调,在容器运行时首次启动时初始化运行时相关模块,如果状态检查失败则返回错误。 如果状态检查正常,在kubelet runtimeState中更新容器运行时的正常运行时间。

go wait.Until(kl.updateRuntimeUp, 5*time.Second, wait.NeverStop)

4.4. syncNetworkUtil

通过循环的方式同步iptables的规则,不过当前代码并没有执行任何操作。

// Start loop to sync iptables util rules
if kl.makeIPTablesUtilChains {
	go wait.Until(kl.syncNetworkUtil, 1*time.Minute, wait.NeverStop)
}

4.5. podKiller

但pod没有被podworker正确处理的时候,启动一个goroutine负责杀死pod。

// Start a goroutine responsible for killing pods (that are not properly
// handled by pod workers).
go wait.Until(kl.podKiller, 1*time.Second, wait.NeverStop)

podKiller代码如下:

此部分代码位于pkg/kubelet/kubelet_pods.go

// podKiller launches a goroutine to kill a pod received from the channel if
// another goroutine isn't already in action.
func (kl *Kubelet) podKiller() {
	killing := sets.NewString()
	// guard for the killing set
	lock := sync.Mutex{}
	for podPair := range kl.podKillingCh {
		runningPod := podPair.RunningPod
		apiPod := podPair.APIPod

		lock.Lock()
		exists := killing.Has(string(runningPod.ID))
		if !exists {
			killing.Insert(string(runningPod.ID))
		}
		lock.Unlock()

		if !exists {
			go func(apiPod *v1.Pod, runningPod *kubecontainer.Pod) {
				glog.V(2).Infof("Killing unwanted pod %q", runningPod.Name)
				err := kl.killPod(apiPod, runningPod, nil, nil)
				if err != nil {
					glog.Errorf("Failed killing the pod %q: %v", runningPod.Name, err)
				}
				lock.Lock()
				killing.Delete(string(runningPod.ID))
				lock.Unlock()
			}(apiPod, runningPod)
		}
	}
}

4.6. statusManager

使用apiserver同步pods状态; 也用作状态缓存。

// Start component sync loops.
kl.statusManager.Start()

statusManager.Start的实现代码如下:

func (m *manager) Start() {
	// Don't start the status manager if we don't have a client. This will happen
	// on the master, where the kubelet is responsible for bootstrapping the pods
	// of the master components.
	if m.kubeClient == nil {
		glog.Infof("Kubernetes client is nil, not starting status manager.")
		return
	}

	glog.Info("Starting to sync pod status with apiserver")
	syncTicker := time.Tick(syncPeriod)
	// syncPod and syncBatch share the same go routine to avoid sync races.
	go wait.Forever(func() {
		select {
		case syncRequest := <-m.podStatusChannel:
			glog.V(5).Infof("Status Manager: syncing pod: %q, with status: (%d, %v) from podStatusChannel",
				syncRequest.podUID, syncRequest.status.version, syncRequest.status.status)
			m.syncPod(syncRequest.podUID, syncRequest.status)
		case <-syncTicker:
			m.syncBatch()
		}
	}, 0)
}

4.7. probeManager

处理容器探针

kl.probeManager.Start()

4.8. runtimeClassManager

// Start syncing RuntimeClasses if enabled.
if kl.runtimeClassManager != nil {
	go kl.runtimeClassManager.Run(wait.NeverStop)
}

4.9. PodLifecycleEventGenerator

// Start the pod lifecycle event generator.
kl.pleg.Start()

PodLifecycleEventGenerator是一个pod生命周期时间生成器接口,具体如下:

// PodLifecycleEventGenerator contains functions for generating pod life cycle events.
type PodLifecycleEventGenerator interface {
	Start()
	Watch() chan *PodLifecycleEvent
	Healthy() (bool, error)
}

start方法具体实现如下:

// Start spawns a goroutine to relist periodically.
func (g *GenericPLEG) Start() {
	go wait.Until(g.relist, g.relistPeriod, wait.NeverStop)
}

4.10. syncLoop

最后调用syncLoop来执行同步变化变更的循环。

kl.syncLoop(updates, kl)

5. syncLoop

syncLoop是处理变化的循环。 它监听来自三种channel(file,apiserver和http)的更改。 对于看到的任何新更改,将针对所需状态和运行状态运行同步。 如果没有看到配置的变化,将在每个同步频率秒同步最后已知的所需状态。

// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
	glog.Info("Starting kubelet main sync loop.")
	// The resyncTicker wakes up kubelet to checks if there are any pod workers
	// that need to be sync'd. A one-second period is sufficient because the
	// sync interval is defaulted to 10s.
	syncTicker := time.NewTicker(time.Second)
	defer syncTicker.Stop()
	housekeepingTicker := time.NewTicker(housekeepingPeriod)
	defer housekeepingTicker.Stop()
	plegCh := kl.pleg.Watch()
	const (
		base   = 100 * time.Millisecond
		max    = 5 * time.Second
		factor = 2
	)
	duration := base
	for {
		if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
			glog.Infof("skipping pod synchronization - %v", rs)
			// exponential backoff
			time.Sleep(duration)
			duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
			continue
		}
		// reset backoff if we have a success
		duration = base

		kl.syncLoopMonitor.Store(kl.clock.Now())
		if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
			break
		}
		kl.syncLoopMonitor.Store(kl.clock.Now())
	}
}

其中调用了syncLoopIteration的函数来执行更具体的监控pod变化的循环。syncLoopIteration代码逻辑待后续单独分析。

6. 总结

6.1. 基本流程

Kubelet.Run主要流程如下:

  1. 初始化模块,其实就是运行imageManagerserverCertificateManageroomWatcherresourceAnalyzer
  2. 运行各种manager,大部分以常驻goroutine的方式运行,其中包括volumeManagerstatusManager等。
  3. 执行处理变更的循环函数syncLoop,对pod的生命周期进行管理。

syncLoop:

syncLoop函数,对pod的生命周期进行管理,其中syncLoop调用了syncLoopIteration函数,该函数根据podUpdate的信息,针对不同的操作,由SyncHandler来执行pod的增删改查等生命周期的管理,其中的syncHandler包括HandlePodSyncsHandlePodCleanups等。该部分逻辑待后续文章具体分析。

6.2. Manager

以下介绍kubelet运行时涉及到的manager的内容。

manager 说明
imageManager 负责镜像垃圾回收
serverCertificateManager 负责处理证书
oomWatcher 监控内存使用,是否发生内存耗尽即OOM
resourceAnalyzer 监控资源使用情况
volumeManager 对pod执行attached/detached/mounted/unmounted操作
statusManager 使用apiserver同步pods状态; 也用作状态缓存
probeManager 处理容器探针
runtimeClassManager 同步RuntimeClasses
podKiller 负责杀死pod

参考文章:

11.5.4 -

kubelet源码分析(四)之 syncLoopIteration

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析kubelet中syncLoopIteration部分。syncLoopIteration通过几种channel来对不同类型的事件进行监听并做增删改查的处理。

1. syncLoop

syncLoop是处理变更的循环。 它监听来自三种channel(file,apiserver和http)的更改。 对于看到的任何新更改,将针对所需状态和运行状态运行同步。 如果没有看到配置的变化,将在每个同步频率秒同步最后已知的所需状态。

此部分代码位于pkg/kubelet/kubelet.go

// syncLoop is the main loop for processing changes. It watches for changes from
// three channels (file, apiserver, and http) and creates a union of them. For
// any new change seen, will run a sync against desired state and running state. If
// no changes are seen to the configuration, will synchronize the last known desired
// state every sync-frequency seconds. Never returns.
func (kl *Kubelet) syncLoop(updates <-chan kubetypes.PodUpdate, handler SyncHandler) {
	glog.Info("Starting kubelet main sync loop.")
	// The resyncTicker wakes up kubelet to checks if there are any pod workers
	// that need to be sync'd. A one-second period is sufficient because the
	// sync interval is defaulted to 10s.
	syncTicker := time.NewTicker(time.Second)
	defer syncTicker.Stop()
	housekeepingTicker := time.NewTicker(housekeepingPeriod)
	defer housekeepingTicker.Stop()
	plegCh := kl.pleg.Watch()
	const (
		base   = 100 * time.Millisecond
		max    = 5 * time.Second
		factor = 2
	)
	duration := base
	for {
		if rs := kl.runtimeState.runtimeErrors(); len(rs) != 0 {
			glog.Infof("skipping pod synchronization - %v", rs)
			// exponential backoff
			time.Sleep(duration)
			duration = time.Duration(math.Min(float64(max), factor*float64(duration)))
			continue
		}
		// reset backoff if we have a success
		duration = base

		kl.syncLoopMonitor.Store(kl.clock.Now())
		if !kl.syncLoopIteration(updates, handler, syncTicker.C, housekeepingTicker.C, plegCh) {
			break
		}
		kl.syncLoopMonitor.Store(kl.clock.Now())
	}
}

其中调用了syncLoopIteration的函数来执行更具体的监控pod变化的循环。

2. syncLoopIteration

syncLoopIteration主要通过几种channel来对不同类型的事件进行监听并处理。其中包括:configChplegChsyncChhouseKeepingChlivenessManager.Updates()

syncLoopIteration实际执行了pod的操作,此部分设置了几种不同的channel:

  • configCh:将配置更改的pod分派给事件类型的相应处理程序回调。
  • plegCh:更新runtime缓存,同步pod。
  • syncCh:同步所有等待同步的pod。
  • houseKeepingCh:触发清理pod。
  • livenessManager.Updates():对失败的pod或者liveness检查失败的pod进行sync操作。

syncLoopIteration部分代码位于pkg/kubelet/kubelet.go

2.1. configCh

configCh将配置更改的pod分派给事件类型的相应处理程序回调,该部分主要通过SyncHandler对pod的不同事件进行增删改查等操作。

func (kl *Kubelet) syncLoopIteration(configCh <-chan kubetypes.PodUpdate, handler SyncHandler,
	syncCh <-chan time.Time, housekeepingCh <-chan time.Time, plegCh <-chan *pleg.PodLifecycleEvent) bool {
	select {
	case u, open := <-configCh:
		// Update from a config source; dispatch it to the right handler
		// callback.
		if !open {
			glog.Errorf("Update channel is closed. Exiting the sync loop.")
			return false
		}

		switch u.Op {
		case kubetypes.ADD:
			glog.V(2).Infof("SyncLoop (ADD, %q): %q", u.Source, format.Pods(u.Pods))
			// After restarting, kubelet will get all existing pods through
			// ADD as if they are new pods. These pods will then go through the
			// admission process and *may* be rejected. This can be resolved
			// once we have checkpointing.
			handler.HandlePodAdditions(u.Pods)
		case kubetypes.UPDATE:
			glog.V(2).Infof("SyncLoop (UPDATE, %q): %q", u.Source, format.PodsWithDeletionTimestamps(u.Pods))
			handler.HandlePodUpdates(u.Pods)
		case kubetypes.REMOVE:
			glog.V(2).Infof("SyncLoop (REMOVE, %q): %q", u.Source, format.Pods(u.Pods))
			handler.HandlePodRemoves(u.Pods)
		case kubetypes.RECONCILE:
			glog.V(4).Infof("SyncLoop (RECONCILE, %q): %q", u.Source, format.Pods(u.Pods))
			handler.HandlePodReconcile(u.Pods)
		case kubetypes.DELETE:
			glog.V(2).Infof("SyncLoop (DELETE, %q): %q", u.Source, format.Pods(u.Pods))
			// DELETE is treated as a UPDATE because of graceful deletion.
			handler.HandlePodUpdates(u.Pods)
		case kubetypes.RESTORE:
			glog.V(2).Infof("SyncLoop (RESTORE, %q): %q", u.Source, format.Pods(u.Pods))
			// These are pods restored from the checkpoint. Treat them as new
			// pods.
			handler.HandlePodAdditions(u.Pods)
		case kubetypes.SET:
			// TODO: Do we want to support this?
			glog.Errorf("Kubelet does not support snapshot update")
		}
		...
}

可以看出syncLoopIteration根据podUpdate的值来执行不同的pod操作,具体如下:

  • ADD:HandlePodAdditions
  • UPDATE:HandlePodUpdates
  • REMOVE:HandlePodRemoves
  • RECONCILE:HandlePodReconcile
  • DELETE:HandlePodUpdates
  • RESTORE:HandlePodAdditions
  • podsToSync:HandlePodSyncs

其中执行pod的handler操作的是SyncHandler,该类型是一个接口,实现体为kubelet本身,具体见后续分析。

2.2. plegCh

plegCh:更新runtime缓存,同步pod。此处调用了HandlePodSyncs的函数。

case e := <-plegCh:
	if isSyncPodWorthy(e) {
		// PLEG event for a pod; sync it.
		if pod, ok := kl.podManager.GetPodByUID(e.ID); ok {
			glog.V(2).Infof("SyncLoop (PLEG): %q, event: %#v", format.Pod(pod), e)
			handler.HandlePodSyncs([]*v1.Pod{pod})
		} else {
			// If the pod no longer exists, ignore the event.
			glog.V(4).Infof("SyncLoop (PLEG): ignore irrelevant event: %#v", e)
		}
	}

	if e.Type == pleg.ContainerDied {
		if containerID, ok := e.Data.(string); ok {
			kl.cleanUpContainersInPod(e.ID, containerID)
		}
	}

2.3. syncCh

syncCh:同步所有等待同步的pod。此处调用了HandlePodSyncs的函数。

case <-syncCh:
	// Sync pods waiting for sync
	podsToSync := kl.getPodsToSync()
	if len(podsToSync) == 0 {
		break
	}
	glog.V(4).Infof("SyncLoop (SYNC): %d pods; %s", len(podsToSync), format.Pods(podsToSync))
	handler.HandlePodSyncs(podsToSync)

2.4. livenessManager.Update

livenessManager.Updates():对失败的pod或者liveness检查失败的pod进行sync操作。此处调用了HandlePodSyncs的函数。

case update := <-kl.livenessManager.Updates():
	if update.Result == proberesults.Failure {
		// The liveness manager detected a failure; sync the pod.

		// We should not use the pod from livenessManager, because it is never updated after
		// initialization.
		pod, ok := kl.podManager.GetPodByUID(update.PodUID)
		if !ok {
			// If the pod no longer exists, ignore the update.
			glog.V(4).Infof("SyncLoop (container unhealthy): ignore irrelevant update: %#v", update)
			break
		}
		glog.V(1).Infof("SyncLoop (container unhealthy): %q", format.Pod(pod))
		handler.HandlePodSyncs([]*v1.Pod{pod})
	}

2.5. housekeepingCh

houseKeepingCh:触发清理pod。此处调用了HandlePodCleanups的函数。

case <-housekeepingCh:
	if !kl.sourcesReady.AllReady() {
		// If the sources aren't ready or volume manager has not yet synced the states,
		// skip housekeeping, as we may accidentally delete pods from unready sources.
		glog.V(4).Infof("SyncLoop (housekeeping, skipped): sources aren't ready yet.")
	} else {
		glog.V(4).Infof("SyncLoop (housekeeping)")
		if err := handler.HandlePodCleanups(); err != nil {
			glog.Errorf("Failed cleaning pods: %v", err)
		}
	}

3. SyncHandler

SyncHandler是一个定义Pod的不同Handler的接口,具体是实现者是kubelet,该接口的方法主要在syncLoopIteration中调用,接口定义如下:

// SyncHandler is an interface implemented by Kubelet, for testability
type SyncHandler interface {
	HandlePodAdditions(pods []*v1.Pod)
	HandlePodUpdates(pods []*v1.Pod)
	HandlePodRemoves(pods []*v1.Pod)
	HandlePodReconcile(pods []*v1.Pod)
	HandlePodSyncs(pods []*v1.Pod)
	HandlePodCleanups() error
}

SyncHandler部分代码位于pkg/kubelet/kubelet.go

3.1. HandlePodAdditions

HandlePodAdditions先根据pod创建时间对pod进行排序,然后遍历pod列表,来执行pod的相关操作。

// HandlePodAdditions is the callback in SyncHandler for pods being added from
// a config source.
func (kl *Kubelet) HandlePodAdditions(pods []*v1.Pod) {
	start := kl.clock.Now()
	sort.Sort(sliceutils.PodsByCreationTime(pods))
	for _, pod := range pods {
    ...
    }
}    

将pod添加到pod manager中。

for _, pod := range pods {
	// Responsible for checking limits in resolv.conf
	if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
		kl.dnsConfigurer.CheckLimitsForResolvConf()
	}
	existingPods := kl.podManager.GetPods()
	// Always add the pod to the pod manager. Kubelet relies on the pod
	// manager as the source of truth for the desired state. If a pod does
	// not exist in the pod manager, it means that it has been deleted in
	// the apiserver and no action (other than cleanup) is required.
	kl.podManager.AddPod(pod)
    ...
}    

如果是mirror pod,则对mirror pod进行处理。

if kubepod.IsMirrorPod(pod) {
	kl.handleMirrorPod(pod, start)
	continue
}

如果当前pod的状态不是Terminated状态,则判断是否接受该pod,如果不接受则将pod状态改为Failed

if !kl.podIsTerminated(pod) {
	// Only go through the admission process if the pod is not
	// terminated.

	// We failed pods that we rejected, so activePods include all admitted
	// pods that are alive.
	activePods := kl.filterOutTerminatedPods(existingPods)

	// Check if we can admit the pod; if not, reject it.
	if ok, reason, message := kl.canAdmitPod(activePods, pod); !ok {
		kl.rejectPod(pod, reason, message)
		continue
	}
}

执行dispatchWork函数,该函数是syncHandler中调用到的核心函数,该函数在pod worker中启动一个异步循环,来分派pod的相关操作。该函数的具体操作待后续分析。

mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)

最后加pod添加到probe manager中。

kl.probeManager.AddPod(pod)

3.2. HandlePodUpdates

HandlePodUpdates同样遍历pod列表,执行相应的操作。

// HandlePodUpdates is the callback in the SyncHandler interface for pods
// being updated from a config source.
func (kl *Kubelet) HandlePodUpdates(pods []*v1.Pod) {
	start := kl.clock.Now()
	for _, pod := range pods {
	...
	}
}

将pod更新到pod manager中。

for _, pod := range pods {
	// Responsible for checking limits in resolv.conf
	if kl.dnsConfigurer != nil && kl.dnsConfigurer.ResolverConfig != "" {
		kl.dnsConfigurer.CheckLimitsForResolvConf()
	}
	kl.podManager.UpdatePod(pod)
    ...
}    

如果是mirror pod,则对mirror pod进行处理。

if kubepod.IsMirrorPod(pod) {
	kl.handleMirrorPod(pod, start)
	continue
}

执行dispatchWork函数。

// TODO: Evaluate if we need to validate and reject updates.

mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
kl.dispatchWork(pod, kubetypes.SyncPodUpdate, mirrorPod, start)

3.3. HandlePodRemoves

HandlePodRemoves遍历pod列表。

// HandlePodRemoves is the callback in the SyncHandler interface for pods
// being removed from a config source.
func (kl *Kubelet) HandlePodRemoves(pods []*v1.Pod) {
	start := kl.clock.Now()
	for _, pod := range pods {
    ...
    }
}    

从pod manager中删除pod。

for _, pod := range pods {
	kl.podManager.DeletePod(pod)
    ...
}    

如果是mirror pod,则对mirror pod进行处理。

if kubepod.IsMirrorPod(pod) {
	kl.handleMirrorPod(pod, start)
	continue
}

调用kubelet的deletePod函数来删除pod。

// Deletion is allowed to fail because the periodic cleanup routine
// will trigger deletion again.
if err := kl.deletePod(pod); err != nil {
	glog.V(2).Infof("Failed to delete pod %q, err: %v", format.Pod(pod), err)
}

deletePod 函数将需要删除的pod加入podKillingCh的channel中,有podKiller监听这个channel去执行删除任务,实现如下:

// deletePod deletes the pod from the internal state of the kubelet by:
// 1.  stopping the associated pod worker asynchronously
// 2.  signaling to kill the pod by sending on the podKillingCh channel
//
// deletePod returns an error if not all sources are ready or the pod is not
// found in the runtime cache.
func (kl *Kubelet) deletePod(pod *v1.Pod) error {
	if pod == nil {
		return fmt.Errorf("deletePod does not allow nil pod")
	}
	if !kl.sourcesReady.AllReady() {
		// If the sources aren't ready, skip deletion, as we may accidentally delete pods
		// for sources that haven't reported yet.
		return fmt.Errorf("skipping delete because sources aren't ready yet")
	}
	kl.podWorkers.ForgetWorker(pod.UID)

	// Runtime cache may not have been updated to with the pod, but it's okay
	// because the periodic cleanup routine will attempt to delete again later.
	runningPods, err := kl.runtimeCache.GetPods()
	if err != nil {
		return fmt.Errorf("error listing containers: %v", err)
	}
	runningPod := kubecontainer.Pods(runningPods).FindPod("", pod.UID)
	if runningPod.IsEmpty() {
		return fmt.Errorf("pod not found")
	}
	podPair := kubecontainer.PodPair{APIPod: pod, RunningPod: &runningPod}

	kl.podKillingCh <- &podPair
	// TODO: delete the mirror pod here?

	// We leave the volume/directory cleanup to the periodic cleanup routine.
	return nil
}

从probe manager中移除pod。

kl.probeManager.RemovePod(pod)

3.4. HandlePodReconcile

遍历pod列表。

// HandlePodReconcile is the callback in the SyncHandler interface for pods
// that should be reconciled.
func (kl *Kubelet) HandlePodReconcile(pods []*v1.Pod) {
	start := kl.clock.Now()
	for _, pod := range pods {
        ...
    }
}   

将pod更新到pod manager中。

for _, pod := range pods {
	// Update the pod in pod manager, status manager will do periodically reconcile according
	// to the pod manager.
	kl.podManager.UpdatePod(pod)
    ...
}    

必要时调整pod的Ready状态,执行dispatchWork函数。

// Reconcile Pod "Ready" condition if necessary. Trigger sync pod for reconciliation.
if status.NeedToReconcilePodReadiness(pod) {
	mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
	kl.dispatchWork(pod, kubetypes.SyncPodSync, mirrorPod, start)
}

如果pod被设定为需要被驱逐的,则删除pod中的容器。

// After an evicted pod is synced, all dead containers in the pod can be removed.
if eviction.PodIsEvicted(pod.Status) {
	if podStatus, err := kl.podCache.Get(pod.UID); err == nil {
		kl.containerDeletor.deleteContainersInPod("", podStatus, true)
	}
}

3.5. HandlePodSyncs

HandlePodSyncssyncHandler接口回调函数,调用dispatchWork,通过pod worker来执行任务。

// HandlePodSyncs is the callback in the syncHandler interface for pods
// that should be dispatched to pod workers for sync.
func (kl *Kubelet) HandlePodSyncs(pods []*v1.Pod) {
	start := kl.clock.Now()
	for _, pod := range pods {
		mirrorPod, _ := kl.podManager.GetMirrorPodByPod(pod)
		kl.dispatchWork(pod, kubetypes.SyncPodSync, mirrorPod, start)
	}
}

3.6. HandlePodCleanups

HandlePodCleanups主要用来执行pod的清理任务,其中包括terminating的pod,orphaned的pod等。

首先查看pod使用到的cgroup。

// HandlePodCleanups performs a series of cleanup work, including terminating
// pod workers, killing unwanted pods, and removing orphaned volumes/pod
// directories.
// NOTE: This function is executed by the main sync loop, so it
// should not contain any blocking calls.
func (kl *Kubelet) HandlePodCleanups() error {
	// The kubelet lacks checkpointing, so we need to introspect the set of pods
	// in the cgroup tree prior to inspecting the set of pods in our pod manager.
	// this ensures our view of the cgroup tree does not mistakenly observe pods
	// that are added after the fact...
	var (
		cgroupPods map[types.UID]cm.CgroupName
		err        error
	)
	if kl.cgroupsPerQOS {
		pcm := kl.containerManager.NewPodContainerManager()
		cgroupPods, err = pcm.GetAllPodsFromCgroups()
		if err != nil {
			return fmt.Errorf("failed to get list of pods that still exist on cgroup mounts: %v", err)
		}
	}
    ...
}

列出所有pod包括mirror pod。

allPods, mirrorPods := kl.podManager.GetPodsAndMirrorPods()
// Pod phase progresses monotonically. Once a pod has reached a final state,
// it should never leave regardless of the restart policy. The statuses
// of such pods should not be changed, and there is no need to sync them.
// TODO: the logic here does not handle two cases:
//   1. If the containers were removed immediately after they died, kubelet
//      may fail to generate correct statuses, let alone filtering correctly.
//   2. If kubelet restarted before writing the terminated status for a pod
//      to the apiserver, it could still restart the terminated pod (even
//      though the pod was not considered terminated by the apiserver).
// These two conditions could be alleviated by checkpointing kubelet.
activePods := kl.filterOutTerminatedPods(allPods)

desiredPods := make(map[types.UID]empty)
for _, pod := range activePods {
	desiredPods[pod.UID] = empty{}
}

pod worker停止不再存在的pod的任务,并从probe manager中清除pod。

// Stop the workers for no-longer existing pods.
// TODO: is here the best place to forget pod workers?
kl.podWorkers.ForgetNonExistingPodWorkers(desiredPods)
kl.probeManager.CleanupPods(activePods)

将需要杀死的pod加入到podKillingCh的channel中,podKiller的任务会监听该channel并获取需要杀死的pod列表来执行杀死pod的操作。

runningPods, err := kl.runtimeCache.GetPods()
if err != nil {
	glog.Errorf("Error listing containers: %#v", err)
	return err
}
for _, pod := range runningPods {
	if _, found := desiredPods[pod.ID]; !found {
		kl.podKillingCh <- &kubecontainer.PodPair{APIPod: nil, RunningPod: pod}
	}
}

当pod不再被绑定到该节点,移除podStatus,其中removeOrphanedPodStatuses最后调用的函数是statusManagerRemoveOrphanedStatuses方法。

kl.removeOrphanedPodStatuses(allPods, mirrorPods)

移除所有的orphaned volume。

// Remove any orphaned volumes.
// Note that we pass all pods (including terminated pods) to the function,
// so that we don't remove volumes associated with terminated but not yet
// deleted pods.
err = kl.cleanupOrphanedPodDirs(allPods, runningPods)
if err != nil {
	// We want all cleanup tasks to be run even if one of them failed. So
	// we just log an error here and continue other cleanup tasks.
	// This also applies to the other clean up tasks.
	glog.Errorf("Failed cleaning up orphaned pod directories: %v", err)
}

移除mirror pod。

// Remove any orphaned mirror pods.
kl.podManager.DeleteOrphanedMirrorPods()

删除不再运行的pod的cgroup。

// Remove any cgroups in the hierarchy for pods that are no longer running.
if kl.cgroupsPerQOS {
	kl.cleanupOrphanedPodCgroups(cgroupPods, activePods)
}

执行垃圾回收(GC)操作。

kl.backOff.GC()

4. dispatchWork

dispatchWork通过pod worker启动一个异步的循环。

完整代码如下:

// dispatchWork starts the asynchronous sync of the pod in a pod worker.
// If the pod is terminated, dispatchWork
func (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {
	if kl.podIsTerminated(pod) {
		if pod.DeletionTimestamp != nil {
			// If the pod is in a terminated state, there is no pod worker to
			// handle the work item. Check if the DeletionTimestamp has been
			// set, and force a status update to trigger a pod deletion request
			// to the apiserver.
			kl.statusManager.TerminatePod(pod)
		}
		return
	}
	// Run the sync in an async worker.
	kl.podWorkers.UpdatePod(&UpdatePodOptions{
		Pod:        pod,
		MirrorPod:  mirrorPod,
		UpdateType: syncType,
		OnCompleteFunc: func(err error) {
			if err != nil {
				metrics.PodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))
			}
		},
	})
	// Note the number of containers for new pods.
	if syncType == kubetypes.SyncPodCreate {
		metrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))
	}
}

以下分段进行分析:

如果pod的状态是处于Terminated状态,则执行statusManagerTerminatePod操作。

// dispatchWork starts the asynchronous sync of the pod in a pod worker.
// If the pod is terminated, dispatchWork
func (kl *Kubelet) dispatchWork(pod *v1.Pod, syncType kubetypes.SyncPodType, mirrorPod *v1.Pod, start time.Time) {
	if kl.podIsTerminated(pod) {
		if pod.DeletionTimestamp != nil {
			// If the pod is in a terminated state, there is no pod worker to
			// handle the work item. Check if the DeletionTimestamp has been
			// set, and force a status update to trigger a pod deletion request
			// to the apiserver.
			kl.statusManager.TerminatePod(pod)
		}
		return
	}
    ...
}    

执行pod worker的UpdatePod函数,该函数是pod worker的核心函数,来执行pod相关操作。具体逻辑待下文分析。

// Run the sync in an async worker.
kl.podWorkers.UpdatePod(&UpdatePodOptions{
	Pod:        pod,
	MirrorPod:  mirrorPod,
	UpdateType: syncType,
	OnCompleteFunc: func(err error) {
		if err != nil {
			metrics.PodWorkerLatency.WithLabelValues(syncType.String()).Observe(metrics.SinceInMicroseconds(start))
		}
	},
})

当创建类型是SyncPodCreate(即创建pod的时候),统计新pod中容器的数目。

// Note the number of containers for new pods.
if syncType == kubetypes.SyncPodCreate {
	metrics.ContainersPerPodCount.Observe(float64(len(pod.Spec.Containers)))
}

5. PodWorkers.UpdatePod

PodWorkers是一个接口类型:

// PodWorkers is an abstract interface for testability.
type PodWorkers interface {
	UpdatePod(options *UpdatePodOptions)
	ForgetNonExistingPodWorkers(desiredPods map[types.UID]empty)
	ForgetWorker(uid types.UID)
}

其中UpdatePod是一个核心方法,通过podUpdates的channel来传递需要处理的pod信息,对于新创建的pod每个pod都会由一个goroutine来执行managePodLoop

此部分代码位于pkg/kubelet/pod_workers.go

// Apply the new setting to the specified pod.
// If the options provide an OnCompleteFunc, the function is invoked if the update is accepted.
// Update requests are ignored if a kill pod request is pending.
func (p *podWorkers) UpdatePod(options *UpdatePodOptions) {
	pod := options.Pod
	uid := pod.UID
	var podUpdates chan UpdatePodOptions
	var exists bool

	p.podLock.Lock()
	defer p.podLock.Unlock()
	if podUpdates, exists = p.podUpdates[uid]; !exists {
		// We need to have a buffer here, because checkForUpdates() method that
		// puts an update into channel is called from the same goroutine where
		// the channel is consumed. However, it is guaranteed that in such case
		// the channel is empty, so buffer of size 1 is enough.
		podUpdates = make(chan UpdatePodOptions, 1)
		p.podUpdates[uid] = podUpdates

		// Creating a new pod worker either means this is a new pod, or that the
		// kubelet just restarted. In either case the kubelet is willing to believe
		// the status of the pod for the first pod worker sync. See corresponding
		// comment in syncPod.
		go func() {
			defer runtime.HandleCrash()
			p.managePodLoop(podUpdates)
		}()
	}
	if !p.isWorking[pod.UID] {
		p.isWorking[pod.UID] = true
		podUpdates <- *options
	} else {
		// if a request to kill a pod is pending, we do not let anything overwrite that request.
		update, found := p.lastUndeliveredWorkUpdate[pod.UID]
		if !found || update.UpdateType != kubetypes.SyncPodKill {
			p.lastUndeliveredWorkUpdate[pod.UID] = *options
		}
	}
}

6. managePodLoop

managePodLoop通过读取podUpdateschannel的信息,执行syncPodFn函数,而syncPodFn函数在newPodWorkers的时候赋值了,即kubelet.syncPodkubelet.syncPod具体代码逻辑待后续文章单独分析。

// newPodWorkers传入syncPod函数
klet.podWorkers = newPodWorkers(klet.syncPod, kubeDeps.Recorder, klet.workQueue, klet.resyncInterval, backOffPeriod, klet.podCache)

newPodWorkers函数参考:

func newPodWorkers(syncPodFn syncPodFnType, recorder record.EventRecorder, workQueue queue.WorkQueue,
	resyncInterval, backOffPeriod time.Duration, podCache kubecontainer.Cache) *podWorkers {
	return &podWorkers{
		podUpdates:                map[types.UID]chan UpdatePodOptions{},
		isWorking:                 map[types.UID]bool{},
		lastUndeliveredWorkUpdate: map[types.UID]UpdatePodOptions{},
		syncPodFn:                 syncPodFn,  // 构造传入klet.syncPod函数
		recorder:                  recorder,
		workQueue:                 workQueue,
		resyncInterval:            resyncInterval,
		backOffPeriod:             backOffPeriod,
		podCache:                  podCache,
	}
}

managePodLoop函数参考:

此部分代码位于pkg/kubelet/pod_workers.go

func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
	var lastSyncTime time.Time
	for update := range podUpdates {
		err := func() error {
			podUID := update.Pod.UID
			// This is a blocking call that would return only if the cache
			// has an entry for the pod that is newer than minRuntimeCache
			// Time. This ensures the worker doesn't start syncing until
			// after the cache is at least newer than the finished time of
			// the previous sync.
			status, err := p.podCache.GetNewerThan(podUID, lastSyncTime)
			if err != nil {
				// This is the legacy event thrown by manage pod loop
				// all other events are now dispatched from syncPodFn
				p.recorder.Eventf(update.Pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
				return err
			}
			err = p.syncPodFn(syncPodOptions{
				mirrorPod:      update.MirrorPod,
				pod:            update.Pod,
				podStatus:      status,
				killPodOptions: update.KillPodOptions,
				updateType:     update.UpdateType,
			})
			lastSyncTime = time.Now()
			return err
		}()
		// notify the call-back function if the operation succeeded or not
		if update.OnCompleteFunc != nil {
			update.OnCompleteFunc(err)
		}
		if err != nil {
			// IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors
			glog.Errorf("Error syncing pod %s (%q), skipping: %v", update.Pod.UID, format.Pod(update.Pod), err)
		}
		p.wrapUp(update.Pod.UID, err)
	}
}

7. 总结

syncLoopIteration基本流程如下:

  1. 通过几种channel来对不同类型的事件进行监听并处理。其中channel包括:configChplegChsyncChhouseKeepingChlivenessManager.Updates()
  2. 不同的SyncHandler执行不同的增删改查操作。
  3. 其中HandlePodAdditionsHandlePodUpdatesHandlePodReconcileHandlePodSyncs都调用到了dispatchWork来执行pod的相关操作。HandlePodCleanups的pod清理任务,通过channel的方式加需要清理的pod给podKiller来清理。
  4. dispatchWork调用podWorkers.UpdatePod执行异步操作。
  5. podWorkers.UpdatePod中调用managePodLoop来执行pod相关操作循环。

channel类型及作用:

  • configCh:将配置更改的pod分派给事件类型的相应处理程序回调。
  • plegCh:更新runtime缓存,同步pod。
  • syncCh:同步所有等待同步的pod。
  • houseKeepingCh:触发清理pod。
  • livenessManager.Updates():对失败的pod或者liveness检查失败的pod进行sync操作。

参考:

11.5.5 -

kubelet源码分析(五)之 syncPod

以下代码分析基于 kubernetes v1.12.0 版本。

本文主要分析kubeletsyncPod的部分。

1. managePodLoop

managePodLoop通过读取podUpdateschannel的信息,执行syncPodFn函数,而syncPodFn函数在newPodWorkers的时候赋值了,即kubelet.syncPod

managePodLoop完整代码如下:

此部分代码位于pkg/kubelet/pod_workers.go

func (p *podWorkers) managePodLoop(podUpdates <-chan UpdatePodOptions) {
	var lastSyncTime time.Time
	for update := range podUpdates {
		err := func() error {
			podUID := update.Pod.UID
			// This is a blocking call that would return only if the cache
			// has an entry for the pod that is newer than minRuntimeCache
			// Time. This ensures the worker doesn't start syncing until
			// after the cache is at least newer than the finished time of
			// the previous sync.
			status, err := p.podCache.GetNewerThan(podUID, lastSyncTime)
			if err != nil {
				// This is the legacy event thrown by manage pod loop
				// all other events are now dispatched from syncPodFn
				p.recorder.Eventf(update.Pod, v1.EventTypeWarning, events.FailedSync, "error determining status: %v", err)
				return err
			}
      // 该部分的syncPodFn实际上的实现函数是kubelet.syncPod
			err = p.syncPodFn(syncPodOptions{
				mirrorPod:      update.MirrorPod,
				pod:            update.Pod,
				podStatus:      status,
				killPodOptions: update.KillPodOptions,
				updateType:     update.UpdateType,
			})
			lastSyncTime = time.Now()
			return err
		}()
		// notify the call-back function if the operation succeeded or not
		if update.OnCompleteFunc != nil {
			update.OnCompleteFunc(err)
		}
		if err != nil {
			// IMPORTANT: we do not log errors here, the syncPodFn is responsible for logging errors
			glog.Errorf("Error syncing pod %s (%q), skipping: %v", update.Pod.UID, format.Pod(update.Pod), err)
		}
		p.wrapUp(update.Pod.UID, err)
	}
}

以下分析syncPod相关逻辑。

2. syncPod

syncPod可以理解为是一个单个pod进行同步任务的事务脚本。其中入参是syncPodOptionssyncPodOptions记录了需要同步的pod的相关信息。具体定义如下:

// syncPodOptions provides the arguments to a SyncPod operation.
type syncPodOptions struct {
	// the mirror pod for the pod to sync, if it is a static pod
	mirrorPod *v1.Pod
	// pod to sync
	pod *v1.Pod
	// the type of update (create, update, sync)
	updateType kubetypes.SyncPodType
	// the current status
	podStatus *kubecontainer.PodStatus
	// if update type is kill, use the specified options to kill the pod.
	killPodOptions *KillPodOptions
}

syncPod主要执行以下的工作流:

  • 如果是正在创建的pod,则记录pod worker的启动latency
  • 调用generateAPIPodStatus为pod提供v1.PodStatus信息。
  • 如果pod是第一次运行,记录pod的启动latency
  • 更新status manager中的pod状态。
  • 如果pod不应该被运行则杀死pod。
  • 如果pod是一个static pod,并且没有对应的mirror pod,则创建一个mirror pod
  • 如果没有pod的数据目录则给pod创建对应的数据目录。
  • 等待volume被attach或mount。
  • 获取pod的secret数据。
  • 调用container runtimeSyncPod函数,执行相关pod操作。
  • 更新pod的ingressegresstraffic limit

当以上任务流中有任何的error,则return error。在下一次执行syncPod的任务流会被再次执行。对于错误信息会被记录到event中,方便debug。

以下对syncPod的执行过程进行分析。

syncPod的代码位于pkg/kubelet/kubelet.go

2.1. SyncPodKill

首先,获取syncPodOptions的pod信息。

func (kl *Kubelet) syncPod(o syncPodOptions) error {
	// pull out the required options
	pod := o.pod
	mirrorPod := o.mirrorPod
	podStatus := o.podStatus
	updateType := o.updateType
    ...
}    

如果pod是需要被杀死的,则执行killPod,会在指定的宽限期内杀死pod。

// if we want to kill a pod, do it now!
if updateType == kubetypes.SyncPodKill {
	killPodOptions := o.killPodOptions
	if killPodOptions == nil || killPodOptions.PodStatusFunc == nil {
		return fmt.Errorf("kill pod options are required if update type is kill")
	}
	apiPodStatus := killPodOptions.PodStatusFunc(pod, podStatus)
	kl.statusManager.SetPodStatus(pod, apiPodStatus)
	// we kill the pod with the specified grace period since this is a termination
	if err := kl.killPod(pod, nil, podStatus, killPodOptions.PodTerminationGracePeriodSecondsOverride); err != nil {
		kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
		// there was an error killing the pod, so we return that error directly
		utilruntime.HandleError(err)
		return err
	}
	return nil
}

2.2. SyncPodCreate

如果pod是需要被创建的,则记录pod的启动latencylatency与pod在apiserver中第一次被记录相关。

// Latency measurements for the main workflow are relative to the
// first time the pod was seen by the API server.
var firstSeenTime time.Time
if firstSeenTimeStr, ok := pod.Annotations[kubetypes.ConfigFirstSeenAnnotationKey]; ok {
	firstSeenTime = kubetypes.ConvertToTimestamp(firstSeenTimeStr).Get()
}

// Record pod worker start latency if being created
// TODO: make pod workers record their own latencies
if updateType == kubetypes.SyncPodCreate {
	if !firstSeenTime.IsZero() {
		// This is the first time we are syncing the pod. Record the latency
		// since kubelet first saw the pod if firstSeenTime is set.
		metrics.PodWorkerStartLatency.Observe(metrics.SinceInMicroseconds(firstSeenTime))
	} else {
		glog.V(3).Infof("First seen time not recorded for pod %q", pod.UID)
	}
}

通过pod和pod status生成最终的api pod status并设置pod的IP。

// Generate final API pod status with pod and status manager status
apiPodStatus := kl.generateAPIPodStatus(pod, podStatus)
// The pod IP may be changed in generateAPIPodStatus if the pod is using host network. (See #24576)
// TODO(random-liu): After writing pod spec into container labels, check whether pod is using host network, and
// set pod IP to hostIP directly in runtime.GetPodStatus
podStatus.IP = apiPodStatus.PodIP

记录pod到running状态的时间。

// Record the time it takes for the pod to become running.
existingStatus, ok := kl.statusManager.GetPodStatus(pod.UID)
if !ok || existingStatus.Phase == v1.PodPending && apiPodStatus.Phase == v1.PodRunning &&
	!firstSeenTime.IsZero() {
	metrics.PodStartLatency.Observe(metrics.SinceInMicroseconds(firstSeenTime))
}

如果pod是不可运行的,则更新pod和container的状态和相应的原因。

runnable := kl.canRunPod(pod)
if !runnable.Admit {
	// Pod is not runnable; update the Pod and Container statuses to why.
	apiPodStatus.Reason = runnable.Reason
	apiPodStatus.Message = runnable.Message
	// Waiting containers are not creating.
	const waitingReason = "Blocked"
	for _, cs := range apiPodStatus.InitContainerStatuses {
		if cs.State.Waiting != nil {
			cs.State.Waiting.Reason = waitingReason
		}
	}
	for _, cs := range apiPodStatus.ContainerStatuses {
		if cs.State.Waiting != nil {
			cs.State.Waiting.Reason = waitingReason
		}
	}
}

并更新status manager中的状态信息,杀死不可运行的pod。

// Update status in the status manager
kl.statusManager.SetPodStatus(pod, apiPodStatus)

// Kill pod if it should not be running
if !runnable.Admit || pod.DeletionTimestamp != nil || apiPodStatus.Phase == v1.PodFailed {
	var syncErr error
	if err := kl.killPod(pod, nil, podStatus, nil); err != nil {
		kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToKillPod, "error killing pod: %v", err)
		syncErr = fmt.Errorf("error killing pod: %v", err)
		utilruntime.HandleError(syncErr)
	} else {
		if !runnable.Admit {
			// There was no error killing the pod, but the pod cannot be run.
			// Return an error to signal that the sync loop should back off.
			syncErr = fmt.Errorf("pod cannot be run: %s", runnable.Message)
		}
	}
	return syncErr
}

如果网络插件还没到Ready状态,则只有在使用host网络模式的情况下才启动pod。

// If the network plugin is not ready, only start the pod if it uses the host network
if rs := kl.runtimeState.networkErrors(); len(rs) != 0 && !kubecontainer.IsHostNetworkPod(pod) {
	kl.recorder.Eventf(pod, v1.EventTypeWarning, events.NetworkNotReady, "%s: %v", NetworkNotReadyErrorMsg, rs)
	return fmt.Errorf("%s: %v", NetworkNotReadyErrorMsg, rs)
}

2.3. Cgroups

给pod创建Cgroups,如果cgroups-per-qos参数开启,则申请相应的资源。对于terminated的pod不需要创建或更新pod的Cgroups

当重新启动kubelet并且启用cgroups-per-qos时,应该间歇性地终止所有pod的运行容器并在qos cgroup hierarchy下重新启动。

如果pod的cgroup已经存在或者pod第一次运行,不杀死pod中容器。

// Create Cgroups for the pod and apply resource parameters
// to them if cgroups-per-qos flag is enabled.
pcm := kl.containerManager.NewPodContainerManager()
// If pod has already been terminated then we need not create
// or update the pod's cgroup
if !kl.podIsTerminated(pod) {
	// When the kubelet is restarted with the cgroups-per-qos
	// flag enabled, all the pod's running containers
	// should be killed intermittently and brought back up
	// under the qos cgroup hierarchy.
	// Check if this is the pod's first sync
	firstSync := true
	for _, containerStatus := range apiPodStatus.ContainerStatuses {
		if containerStatus.State.Running != nil {
			firstSync = false
			break
		}
	}
	// Don't kill containers in pod if pod's cgroups already
	// exists or the pod is running for the first time
	podKilled := false
	if !pcm.Exists(pod) && !firstSync {
		if err := kl.killPod(pod, nil, podStatus, nil); err == nil {
			podKilled = true
		}
	}
    ...

如果pod被杀死并且重启策略是Never,则不创建或更新对应的Cgroups,否则创建和更新pod的Cgroups

// Create and Update pod's Cgroups
// Don't create cgroups for run once pod if it was killed above
// The current policy is not to restart the run once pods when
// the kubelet is restarted with the new flag as run once pods are
// expected to run only once and if the kubelet is restarted then
// they are not expected to run again.
// We don't create and apply updates to cgroup if its a run once pod and was killed above
if !(podKilled && pod.Spec.RestartPolicy == v1.RestartPolicyNever) {
	if !pcm.Exists(pod) {
		if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
			glog.V(2).Infof("Failed to update QoS cgroups while syncing pod: %v", err)
		}
		if err := pcm.EnsureExists(pod); err != nil {
			kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToCreatePodContainer, "unable to ensure pod container exists: %v", err)
			return fmt.Errorf("failed to ensure that the pod: %v cgroups exist and are correctly applied: %v", pod.UID, err)
		}
	}
}

其中创建Cgroups是通过containerManagerUpdateQOSCgroups来执行。

if err := kl.containerManager.UpdateQOSCgroups(); err != nil {
	glog.V(2).Infof("Failed to update QoS cgroups while syncing pod: %v", err)
}

2.4. Mirror Pod

如果pod是一个static pod,没有对应的mirror pod,则创建一个mirror pod;如果存在mirror pod则删除再重建一个mirror pod

// Create Mirror Pod for Static Pod if it doesn't already exist
if kubepod.IsStaticPod(pod) {
	podFullName := kubecontainer.GetPodFullName(pod)
	deleted := false
	if mirrorPod != nil {
		if mirrorPod.DeletionTimestamp != nil || !kl.podManager.IsMirrorPodOf(mirrorPod, pod) {
			// The mirror pod is semantically different from the static pod. Remove
			// it. The mirror pod will get recreated later.
			glog.Warningf("Deleting mirror pod %q because it is outdated", format.Pod(mirrorPod))
			if err := kl.podManager.DeleteMirrorPod(podFullName); err != nil {
				glog.Errorf("Failed deleting mirror pod %q: %v", format.Pod(mirrorPod), err)
			} else {
				deleted = true
			}
		}
	}
	if mirrorPod == nil || deleted {
		node, err := kl.GetNode()
		if err != nil || node.DeletionTimestamp != nil {
			glog.V(4).Infof("No need to create a mirror pod, since node %q has been removed from the cluster", kl.nodeName)
		} else {
			glog.V(4).Infof("Creating a mirror pod for static pod %q", format.Pod(pod))
			if err := kl.podManager.CreateMirrorPod(pod); err != nil {
				glog.Errorf("Failed creating a mirror pod for %q: %v", format.Pod(pod), err)
			}
		}
	}
}

2.5. makePodDataDirs

给pod创建数据目录。

// Make data directories for the pod
if err := kl.makePodDataDirs(pod); err != nil {
	kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedToMakePodDataDirectories, "error making pod data directories: %v", err)
	glog.Errorf("Unable to make pod data directories for pod %q: %v", format.Pod(pod), err)
	return err
}

其中数据目录包括

  • PodDir:{kubelet.rootDirectory}/pods/podUID
  • PodVolumesDir:{PodDir}/volumes
  • PodPluginsDir:{PodDir}/plugins
// makePodDataDirs creates the dirs for the pod datas.
func (kl *Kubelet) makePodDataDirs(pod *v1.Pod) error {
	uid := pod.UID
	if err := os.MkdirAll(kl.getPodDir(uid), 0750); err != nil && !os.IsExist(err) {
		return err
	}
	if err := os.MkdirAll(kl.getPodVolumesDir(uid), 0750); err != nil && !os.IsExist(err) {
		return err
	}
	if err := os.MkdirAll(kl.getPodPluginsDir(uid), 0750); err != nil && !os.IsExist(err) {
		return err
	}
	return nil
}

2.6. mount volumes

对非terminated状态的pod挂载volume

// Volume manager will not mount volumes for terminated pods
if !kl.podIsTerminated(pod) {
	// Wait for volumes to attach/mount
	if err := kl.volumeManager.WaitForAttachAndMount(pod); err != nil {
		kl.recorder.Eventf(pod, v1.EventTypeWarning, events.FailedMountVolume, "Unable to mount volumes for pod %q: %v", format.Pod(pod), err)
		glog.Errorf("Unable to mount volumes for pod %q: %v; skipping pod", format.Pod(pod), err)
		return err
	}
}

2.7. PullSecretsForPod

获取pod的secret数据。

// Fetch the pull secrets for the pod
pullSecrets := kl.getPullSecretsForPod(pod)

getPullSecretsForPod具体实现函数如下:

// getPullSecretsForPod inspects the Pod and retrieves the referenced pull
// secrets.
func (kl *Kubelet) getPullSecretsForPod(pod *v1.Pod) []v1.Secret {
	pullSecrets := []v1.Secret{}

	for _, secretRef := range pod.Spec.ImagePullSecrets {
		secret, err := kl.secretManager.GetSecret(pod.Namespace, secretRef.Name)
		if err != nil {
			glog.Warningf("Unable to retrieve pull secret %s/%s for %s/%s due to %v.  The image pull may not succeed.", pod.Namespace, secretRef.Name, pod.Namespace, pod.Name, err)
			continue
		}

		pullSecrets = append(pullSecrets, *secret)
	}

	return pullSecrets
}

2.8. containerRuntime.SyncPod

调用container runtimeSyncPod函数,执行相关pod操作,由此kubelet.syncPod的操作逻辑转入containerRuntime.SyncPod函数中。

// Call the container runtime's SyncPod callback
result := kl.containerRuntime.SyncPod(pod, apiPodStatus, podStatus, pullSecrets, kl.backOff)
kl.reasonCache.Update(pod.UID, result)
if err := result.Error(); err != nil {
	// Do not return error if the only failures were pods in backoff
	for _, r := range result.SyncResults {
		if r.Error != kubecontainer.ErrCrashLoopBackOff && r.Error != images.ErrImagePullBackOff {
			// Do not record an event here, as we keep all event logging for sync pod failures
			// local to container runtime so we get better errors
			return err
		}
	}

	return nil
}

3. Runtime.SyncPod

SyncPod主要执行sync操作使得运行的pod达到期望状态的pod。主要执行以下操作:

  • 计算sandboxcontainer的变化。
  • 必要的时候杀死pod。
  • 杀死所有不需要运行的container
  • 必要时创建sandbox
  • 创建init container
  • 创建正常的container

Runtime.SyncPod部分代码位于pkg/kubelet/kuberuntime/kuberuntime_manager.go

3.1. computePodActions

计算sandboxcontainer的变化。

// Step 1: Compute sandbox and container changes.
podContainerChanges := m.computePodActions(pod, podStatus)
glog.V(3).Infof("computePodActions got %+v for pod %q", podContainerChanges, format.Pod(pod))
if podContainerChanges.CreateSandbox {
	ref, err := ref.GetReference(legacyscheme.Scheme, pod)
	if err != nil {
		glog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), err)
	}
	if podContainerChanges.SandboxID != "" {
		m.recorder.Eventf(ref, v1.EventTypeNormal, events.SandboxChanged, "Pod sandbox changed, it will be killed and re-created.")
	} else {
		glog.V(4).Infof("SyncPod received new pod %q, will create a sandbox for it", format.Pod(pod))
	}
}

3.2. killPodWithSyncResult

必要的时候杀死pod。

// Step 2: Kill the pod if the sandbox has changed.
if podContainerChanges.KillPod {
	if !podContainerChanges.CreateSandbox {
		glog.V(4).Infof("Stopping PodSandbox for %q because all other containers are dead.", format.Pod(pod))
	} else {
		glog.V(4).Infof("Stopping PodSandbox for %q, will start new one", format.Pod(pod))
	}

	killResult := m.killPodWithSyncResult(pod, kubecontainer.ConvertPodStatusToRunningPod(m.runtimeName, podStatus), nil)
	result.AddPodSyncResult(killResult)
	if killResult.Error() != nil {
		glog.Errorf("killPodWithSyncResult failed: %v", killResult.Error())
		return
	}

	if podContainerChanges.CreateSandbox {
		m.purgeInitContainers(pod, podStatus)
	}
}

3.3. killContainer

杀死所有不需要运行的container

// Step 3: kill any running containers in this pod which are not to keep.
for containerID, containerInfo := range podContainerChanges.ContainersToKill {
	glog.V(3).Infof("Killing unwanted container %q(id=%q) for pod %q", containerInfo.name, containerID, format.Pod(pod))
	killContainerResult := kubecontainer.NewSyncResult(kubecontainer.KillContainer, containerInfo.name)
	result.AddSyncResult(killContainerResult)
	if err := m.killContainer(pod, containerID, containerInfo.name, containerInfo.message, nil); err != nil {
		killContainerResult.Fail(kubecontainer.ErrKillContainer, err.Error())
		glog.Errorf("killContainer %q(id=%q) for pod %q failed: %v", containerInfo.name, containerID, format.Pod(pod), err)
		return
	}
}

3.4. createPodSandbox

必要时创建sandbox

// Step 4: Create a sandbox for the pod if necessary.
...
glog.V(4).Infof("Creating sandbox for pod %q", format.Pod(pod))
createSandboxResult := kubecontainer.NewSyncResult(kubecontainer.CreatePodSandbox, format.Pod(pod))
result.AddSyncResult(createSandboxResult)
podSandboxID, msg, err = m.createPodSandbox(pod, podContainerChanges.Attempt)
if err != nil {
	createSandboxResult.Fail(kubecontainer.ErrCreatePodSandbox, msg)
	glog.Errorf("createPodSandbox for pod %q failed: %v", format.Pod(pod), err)
	ref, referr := ref.GetReference(legacyscheme.Scheme, pod)
	if referr != nil {
		glog.Errorf("Couldn't make a ref to pod %q: '%v'", format.Pod(pod), referr)
	}
	m.recorder.Eventf(ref, v1.EventTypeWarning, events.FailedCreatePodSandBox, "Failed create pod sandbox: %v", err)
	return
}
glog.V(4).Infof("Created PodSandbox %q for pod %q", podSandboxID, format.Pod(pod))

3.5. start init container

创建init container

// Step 5: start the init container.
if container := podContainerChanges.NextInitContainerToStart; container != nil {
	// Start the next init container.
	startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)
	result.AddSyncResult(startContainerResult)
	isInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)
	if isInBackOff {
		startContainerResult.Fail(err, msg)
		glog.V(4).Infof("Backing Off restarting init container %+v in pod %v", container, format.Pod(pod))
		return
	}

	glog.V(4).Infof("Creating init container %+v in pod %v", container, format.Pod(pod))
	if msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeInit); err != nil {
		startContainerResult.Fail(err, msg)
		utilruntime.HandleError(fmt.Errorf("init container start failed: %v: %s", err, msg))
		return
	}

	// Successfully started the container; clear the entry in the failure
	glog.V(4).Infof("Completed init container %q for pod %q", container.Name, format.Pod(pod))
}

3.6. start containers

创建正常的container

// Step 6: start containers in podContainerChanges.ContainersToStart.
for _, idx := range podContainerChanges.ContainersToStart {
	container := &pod.Spec.Containers[idx]
	startContainerResult := kubecontainer.NewSyncResult(kubecontainer.StartContainer, container.Name)
	result.AddSyncResult(startContainerResult)

	isInBackOff, msg, err := m.doBackOff(pod, container, podStatus, backOff)
	if isInBackOff {
		startContainerResult.Fail(err, msg)
		glog.V(4).Infof("Backing Off restarting container %+v in pod %v", container, format.Pod(pod))
		continue
	}

	glog.V(4).Infof("Creating container %+v in pod %v", container, format.Pod(pod))
  // 通过startContainer来运行容器
	if msg, err := m.startContainer(podSandboxID, podSandboxConfig, container, pod, podStatus, pullSecrets, podIP, kubecontainer.ContainerTypeRegular); err != nil {
		startContainerResult.Fail(err, msg)
		// known errors that are logged in other places are logged at higher levels here to avoid
		// repetitive log spam
		switch {
		case err == images.ErrImagePullBackOff:
			glog.V(3).Infof("container start failed: %v: %s", err, msg)
		default:
			utilruntime.HandleError(fmt.Errorf("container start failed: %v: %s", err, msg))
		}
		continue
	}
}

4. startContainer

startContainer启动一个容器并返回是否成功。

主要包括以下几个步骤:

  1. 拉取镜像
  2. 创建容器
  3. 启动容器
  4. 运行post start lifecycle hooks(如果有设置此项)

startContainer完整代码如下:

startContainer部分代码位于pkg/kubelet/kuberuntime/kuberuntime_container.go

// startContainer starts a container and returns a message indicates why it is failed on error.
// It starts the container through the following steps:
// * pull the image
// * create the container
// * start the container
// * run the post start lifecycle hooks (if applicable)
func (m *kubeGenericRuntimeManager) startContainer(podSandboxID string, podSandboxConfig *runtimeapi.PodSandboxConfig, container *v1.Container, pod *v1.Pod, podStatus *kubecontainer.PodStatus, pullSecrets []v1.Secret, podIP string, containerType kubecontainer.ContainerType) (string, error) {
	// Step 1: pull the image.
	imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets)
	if err != nil {
		m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
		return msg, err
	}

	// Step 2: create the container.
	ref, err := kubecontainer.GenerateContainerRef(pod, container)
	if err != nil {
		glog.Errorf("Can't make a ref to pod %q, container %v: %v", format.Pod(pod), container.Name, err)
	}
	glog.V(4).Infof("Generating ref for container %s: %#v", container.Name, ref)

	// For a new container, the RestartCount should be 0
	restartCount := 0
	containerStatus := podStatus.FindContainerStatusByName(container.Name)
	if containerStatus != nil {
		restartCount = containerStatus.RestartCount + 1
	}

	containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, containerType)
	if cleanupAction != nil {
		defer cleanupAction()
	}
	if err != nil {
		m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
		return grpc.ErrorDesc(err), ErrCreateContainerConfig
	}

	containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
	if err != nil {
		m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
		return grpc.ErrorDesc(err), ErrCreateContainer
	}
	err = m.internalLifecycle.PreStartContainer(pod, container, containerID)
	if err != nil {
		m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Internal PreStartContainer hook failed: %v", grpc.ErrorDesc(err))
		return grpc.ErrorDesc(err), ErrPreStartHook
	}
	m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.CreatedContainer, "Created container")

	if ref != nil {
		m.containerRefManager.SetRef(kubecontainer.ContainerID{
			Type: m.runtimeName,
			ID:   containerID,
		}, ref)
	}

	// Step 3: start the container.
	err = m.runtimeService.StartContainer(containerID)
	if err != nil {
		m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Error: %v", grpc.ErrorDesc(err))
		return grpc.ErrorDesc(err), kubecontainer.ErrRunContainer
	}
	m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, "Started container")

	// Symlink container logs to the legacy container log location for cluster logging
	// support.
	// TODO(random-liu): Remove this after cluster logging supports CRI container log path.
	containerMeta := containerConfig.GetMetadata()
	sandboxMeta := podSandboxConfig.GetMetadata()
	legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,
		sandboxMeta.Namespace)
	containerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)
	// only create legacy symlink if containerLog path exists (or the error is not IsNotExist).
	// Because if containerLog path does not exist, only dandling legacySymlink is created.
	// This dangling legacySymlink is later removed by container gc, so it does not make sense
	// to create it in the first place. it happens when journald logging driver is used with docker.
	if _, err := m.osInterface.Stat(containerLog); !os.IsNotExist(err) {
		if err := m.osInterface.Symlink(containerLog, legacySymlink); err != nil {
			glog.Errorf("Failed to create legacy symbolic link %q to container %q log %q: %v",
				legacySymlink, containerID, containerLog, err)
		}
	}

	// Step 4: execute the post start hook.
	if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
		kubeContainerID := kubecontainer.ContainerID{
			Type: m.runtimeName,
			ID:   containerID,
		}
		msg, handlerErr := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)
		if handlerErr != nil {
			m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, msg)
			if err := m.killContainer(pod, kubeContainerID, container.Name, "FailedPostStartHook", nil); err != nil {
				glog.Errorf("Failed to kill container %q(id=%q) in pod %q: %v, %v",
					container.Name, kubeContainerID.String(), format.Pod(pod), ErrPostStartHook, err)
			}
			return msg, fmt.Errorf("%s: %v", ErrPostStartHook, handlerErr)
		}
	}

	return "", nil
}

以下对startContainer分段分析:

4.1. pull image

通过EnsureImageExists方法拉取拉取指定pod容器的镜像,并返回镜像信息和错误。

// Step 1: pull the image.
imageRef, msg, err := m.imagePuller.EnsureImageExists(pod, container, pullSecrets)
if err != nil {
	m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
	return msg, err
}

4.2. CreateContainer

首先生成container的*v1.ObjectReference对象,该对象包括container的相关信息。

// Step 2: create the container.
ref, err := kubecontainer.GenerateContainerRef(pod, container)
if err != nil {
	glog.Errorf("Can't make a ref to pod %q, container %v: %v", format.Pod(pod), container.Name, err)
}
glog.V(4).Infof("Generating ref for container %s: %#v", container.Name, ref)

统计container的重启次数,新的容器默认重启次数为0。

// For a new container, the RestartCount should be 0
restartCount := 0
containerStatus := podStatus.FindContainerStatusByName(container.Name)
if containerStatus != nil {
	restartCount = containerStatus.RestartCount + 1
}

生成container的配置。

containerConfig, cleanupAction, err := m.generateContainerConfig(container, pod, restartCount, podIP, imageRef, containerType)
if cleanupAction != nil {
	defer cleanupAction()
}
if err != nil {
	m.recordContainerEvent(pod, container, "", v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
	return grpc.ErrorDesc(err), ErrCreateContainerConfig
}

调用runtimeService,执行CreateContainer的操作。

containerID, err := m.runtimeService.CreateContainer(podSandboxID, containerConfig, podSandboxConfig)
if err != nil {
	m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToCreateContainer, "Error: %v", grpc.ErrorDesc(err))
	return grpc.ErrorDesc(err), ErrCreateContainer
}
err = m.internalLifecycle.PreStartContainer(pod, container, containerID)
if err != nil {
	m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Internal PreStartContainer hook failed: %v", grpc.ErrorDesc(err))
	return grpc.ErrorDesc(err), ErrPreStartHook
}
m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.CreatedContainer, "Created container")

if ref != nil {
	m.containerRefManager.SetRef(kubecontainer.ContainerID{
		Type: m.runtimeName,
		ID:   containerID,
	}, ref)
}

4.3. StartContainer

执行runtimeServiceStartContainer方法,来启动容器。

// Step 3: start the container.
err = m.runtimeService.StartContainer(containerID)
if err != nil {
	m.recordContainerEvent(pod, container, containerID, v1.EventTypeWarning, events.FailedToStartContainer, "Error: %v", grpc.ErrorDesc(err))
	return grpc.ErrorDesc(err), kubecontainer.ErrRunContainer
}
m.recordContainerEvent(pod, container, containerID, v1.EventTypeNormal, events.StartedContainer, "Started container")

// Symlink container logs to the legacy container log location for cluster logging
// support.
// TODO(random-liu): Remove this after cluster logging supports CRI container log path.
containerMeta := containerConfig.GetMetadata()
sandboxMeta := podSandboxConfig.GetMetadata()
legacySymlink := legacyLogSymlink(containerID, containerMeta.Name, sandboxMeta.Name,
	sandboxMeta.Namespace)
containerLog := filepath.Join(podSandboxConfig.LogDirectory, containerConfig.LogPath)
// only create legacy symlink if containerLog path exists (or the error is not IsNotExist).
// Because if containerLog path does not exist, only dandling legacySymlink is created.
// This dangling legacySymlink is later removed by container gc, so it does not make sense
// to create it in the first place. it happens when journald logging driver is used with docker.
if _, err := m.osInterface.Stat(containerLog); !os.IsNotExist(err) {
	if err := m.osInterface.Symlink(containerLog, legacySymlink); err != nil {
		glog.Errorf("Failed to create legacy symbolic link %q to container %q log %q: %v",
			legacySymlink, containerID, containerLog, err)
	}
}

4.4. execute post start hook

如果有指定Lifecycle.PostStart,则执行PostStart操作,PostStart如果执行失败,则容器会根据重启的规则进行重启。

// Step 4: execute the post start hook.
if container.Lifecycle != nil && container.Lifecycle.PostStart != nil {
	kubeContainerID := kubecontainer.ContainerID{
		Type: m.runtimeName,
		ID:   containerID,
	}
	msg, handlerErr := m.runner.Run(kubeContainerID, pod, container, container.Lifecycle.PostStart)
	if handlerErr != nil {
		m.recordContainerEvent(pod, container, kubeContainerID.ID, v1.EventTypeWarning, events.FailedPostStartHook, msg)
		if err := m.killContainer(pod, kubeContainerID, container.Name, "FailedPostStartHook", nil); err != nil {
			glog.Errorf("Failed to kill container %q(id=%q) in pod %q: %v, %v",
				container.Name, kubeContainerID.String(), format.Pod(pod), ErrPostStartHook, err)
		}
		return msg, fmt.Errorf("%s: %v", ErrPostStartHook, handlerErr)
	}
}

5. 总结

kubelet的工作是管理pod在Node上的生命周期(包括增删改查),kubelet通过各种类型的manager异步工作各自执行各自的任务,其中使用到了多种的channel来控制状态信号变化的传递,例如比较重要的channel有podUpdates <-chan UpdatePodOptions,来传递pod的变化情况。

创建pod的调用逻辑

syncLoopIteration-->kubetypes.ADD-->HandlePodAdditions(u.Pods)-->dispatchWork(pod, kubetypes.SyncPodCreate, mirrorPod, start)-->podWorkers.UpdatePod-->managePodLoop(podUpdates)-->syncPod(o syncPodOptions)-->containerRuntime.SyncPod-->startContainer

参考:

12 - Runtime

12.1 - Runc和Containerd概述

本文主要分析OCI,CRI,runc,containerd,cri-containerd,dockershim等组件说明及调用关系。

1. 概述

各个组件调用关系图如下:

关系图

图片来源:https://www.jianshu.com/p/62e71584d1cb

2. OCI(Open Container Initiative)

OCI(Open Container Initiative)即开放的容器运行时规范,目的在于定义一个容器运行时及镜像的相关标准和规范,其中包括

  • runtime-spec:容器的生命周期管理,具体参考runtime-spec
  • image-spec:镜像的生命周期管理,具体参考image-spec

实现OCI标准的容器运行时有runckata等。

3. RunC

runc(run container)是一个基于OCI标准实现的一个轻量级容器运行工具,用来创建和运行容器。而Containerd是用来维持通过runc创建的容器的运行状态。即runc用来创建和运行容器,containerd作为常驻进程用来管理容器。

runc包含libcontainer,包括对namespace和cgroup的调用操作。

命令参数:

To start a new instance of a container:

    # runc run [ -b bundle ] <container-id>
    
USAGE:
   runc [global options] command [command options] [arguments...]

COMMANDS:
   checkpoint  checkpoint a running container
   create      create a container
   delete      delete any resources held by the container often used with detached container
   events      display container events such as OOM notifications, cpu, memory, and IO usage statistics
   exec        execute new process inside the container
   init        initialize the namespaces and launch the process (do not call it outside of runc)
   kill        kill sends the specified signal (default: SIGTERM) to the container's init process
   list        lists containers started by runc with the given root
   pause       pause suspends all processes inside the container
   ps          ps displays the processes running inside a container
   restore     restore a container from a previous checkpoint
   resume      resumes all processes that have been previously paused
   run         create and run a container
   spec        create a new specification file
   start       executes the user defined process in a created container
   state       output the state of a container
   update      update container resource constraints
   help, h     Shows a list of commands or help for one command    

4. Containerd

containerd(container daemon)是一个daemon进程用来管理和运行容器,可以用来拉取/推送镜像和管理容器的存储和网络。其中可以调用runc来创建和运行容器。

4.1. containerd的架构图

4.2. docker与containerd、runc的关系图

更具体的调用逻辑:

5. CRI(Container Runtime Interface

CRI即容器运行时接口,主要用来定义k8s与容器运行时的API调用,kubelet通过CRI来调用容器运行时,只要实现了CRI接口的容器运行时就可以对接到k8s的kubelet组件。

5.1. docker与k8s调用containerd的关系图

5.2. cri-api

5.2.1. runtime service

// Runtime service defines the public APIs for remote container runtimes
service RuntimeService {
    // Version returns the runtime name, runtime version, and runtime API version.
    rpc Version(VersionRequest) returns (VersionResponse) {}

    // RunPodSandbox creates and starts a pod-level sandbox. Runtimes must ensure
    // the sandbox is in the ready state on success.
    rpc RunPodSandbox(RunPodSandboxRequest) returns (RunPodSandboxResponse) {}
    // StopPodSandbox stops any running process that is part of the sandbox and
    // reclaims network resources (e.g., IP addresses) allocated to the sandbox.
    // If there are any running containers in the sandbox, they must be forcibly
    // terminated.
    // This call is idempotent, and must not return an error if all relevant
    // resources have already been reclaimed. kubelet will call StopPodSandbox
    // at least once before calling RemovePodSandbox. It will also attempt to
    // reclaim resources eagerly, as soon as a sandbox is not needed. Hence,
    // multiple StopPodSandbox calls are expected.
    rpc StopPodSandbox(StopPodSandboxRequest) returns (StopPodSandboxResponse) {}
    // RemovePodSandbox removes the sandbox. If there are any running containers
    // in the sandbox, they must be forcibly terminated and removed.
    // This call is idempotent, and must not return an error if the sandbox has
    // already been removed.
    rpc RemovePodSandbox(RemovePodSandboxRequest) returns (RemovePodSandboxResponse) {}
    // PodSandboxStatus returns the status of the PodSandbox. If the PodSandbox is not
    // present, returns an error.
    rpc PodSandboxStatus(PodSandboxStatusRequest) returns (PodSandboxStatusResponse) {}
    // ListPodSandbox returns a list of PodSandboxes.
    rpc ListPodSandbox(ListPodSandboxRequest) returns (ListPodSandboxResponse) {}

    // CreateContainer creates a new container in specified PodSandbox
    rpc CreateContainer(CreateContainerRequest) returns (CreateContainerResponse) {}
    // StartContainer starts the container.
    rpc StartContainer(StartContainerRequest) returns (StartContainerResponse) {}
    // StopContainer stops a running container with a grace period (i.e., timeout).
    // This call is idempotent, and must not return an error if the container has
    // already been stopped.
    // The runtime must forcibly kill the container after the grace period is
    // reached.
    rpc StopContainer(StopContainerRequest) returns (StopContainerResponse) {}
    // RemoveContainer removes the container. If the container is running, the
    // container must be forcibly removed.
    // This call is idempotent, and must not return an error if the container has
    // already been removed.
    rpc RemoveContainer(RemoveContainerRequest) returns (RemoveContainerResponse) {}
    // ListContainers lists all containers by filters.
    rpc ListContainers(ListContainersRequest) returns (ListContainersResponse) {}
    // ContainerStatus returns status of the container. If the container is not
    // present, returns an error.
    rpc ContainerStatus(ContainerStatusRequest) returns (ContainerStatusResponse) {}
    // UpdateContainerResources updates ContainerConfig of the container.
    rpc UpdateContainerResources(UpdateContainerResourcesRequest) returns (UpdateContainerResourcesResponse) {}
    // ReopenContainerLog asks runtime to reopen the stdout/stderr log file
    // for the container. This is often called after the log file has been
    // rotated. If the container is not running, container runtime can choose
    // to either create a new log file and return nil, or return an error.
    // Once it returns error, new container log file MUST NOT be created.
    rpc ReopenContainerLog(ReopenContainerLogRequest) returns (ReopenContainerLogResponse) {}

    // ExecSync runs a command in a container synchronously.
    rpc ExecSync(ExecSyncRequest) returns (ExecSyncResponse) {}
    // Exec prepares a streaming endpoint to execute a command in the container.
    rpc Exec(ExecRequest) returns (ExecResponse) {}
    // Attach prepares a streaming endpoint to attach to a running container.
    rpc Attach(AttachRequest) returns (AttachResponse) {}
    // PortForward prepares a streaming endpoint to forward ports from a PodSandbox.
    rpc PortForward(PortForwardRequest) returns (PortForwardResponse) {}

    // ContainerStats returns stats of the container. If the container does not
    // exist, the call returns an error.
    rpc ContainerStats(ContainerStatsRequest) returns (ContainerStatsResponse) {}
    // ListContainerStats returns stats of all running containers.
    rpc ListContainerStats(ListContainerStatsRequest) returns (ListContainerStatsResponse) {}

    // UpdateRuntimeConfig updates the runtime configuration based on the given request.
    rpc UpdateRuntimeConfig(UpdateRuntimeConfigRequest) returns (UpdateRuntimeConfigResponse) {}

    // Status returns the status of the runtime.
    rpc Status(StatusRequest) returns (StatusResponse) {}
}

5.2.2. image service

// ImageService defines the public APIs for managing images.
service ImageService {
    // ListImages lists existing images.
    rpc ListImages(ListImagesRequest) returns (ListImagesResponse) {}
    // ImageStatus returns the status of the image. If the image is not
    // present, returns a response with ImageStatusResponse.Image set to
    // nil.
    rpc ImageStatus(ImageStatusRequest) returns (ImageStatusResponse) {}
    // PullImage pulls an image with authentication config.
    rpc PullImage(PullImageRequest) returns (PullImageResponse) {}
    // RemoveImage removes the image.
    // This call is idempotent, and must not return an error if the image has
    // already been removed.
    rpc RemoveImage(RemoveImageRequest) returns (RemoveImageResponse) {}
    // ImageFSInfo returns information of the filesystem that is used to store images.
    rpc ImageFsInfo(ImageFsInfoRequest) returns (ImageFsInfoResponse) {}
}

5.3. cri-containerd

5.3.1. CRI Plugin调用流程

  1. kubelet调用CRI插件,通过CRI Runtime Service接口创建pod
  2. cri通过CNI接口创建和配置pod的network namespace
  3. cri调用containerd创建sandbox container(pause container )并将容器放入pod的cgroup和namespace中
  4. kubelet调用CRI插件,通过image service接口拉取镜像,接着通过containerd来拉取镜像
  5. kubelet调用CRI插件,通过runtime service接口运行拉取下来的镜像服务,最后通过containerd来运行业务容器,并将容器放入pod的cgroup和namespace中。

具体参考:https://github.com/containerd/cri/blob/release/1.4/docs/architecture.md

5.3.2. k8s对runtime调用的演进

由原来通过dockershim调用docker再调用containerd,直接变成通过cri-containerd调用containerd,从而减少了一层docker调用逻辑。

具体参考:https://github.com/containerd/cri/blob/release/1.4/docs/proposal.md

5.4. Dockershim

在旧版本的k8s中,由于docker没有实现CRI接口,因此增加一个Dockershim来实现k8s对docker的调用。(shim:垫片,一般用来表示对第三方组件API调用的适配插件,例如k8s使用Dockershim来实现对docker接口的适配调用)

5.5. CRI-O

cri-o与containerd类似,用来实现容器的管理,可替换containerd的使用。

参考:

12.2 - Containerd

12.2.1 - 安装Containerd

1. Ubuntu安装containerd

以下以Ubuntu为例

说明:安装containerd与安装docker流程基本一致,差别在于不需要安装docker-ce

  • containerd: apt-get install -y containerd.io
  • docker: apt-get install docker-ce docker-ce-cli containerd.io

1. 卸载旧版本

 sudo apt-get remove docker docker-engine docker.io containerd runc

如果需要删除镜像及容器数据则执行以下命令

 sudo rm -rf /var/lib/docker
 sudo rm -rf /var/lib/containerd

2. 准备包环境

1、更新apt,允许使用https。

 sudo apt-get update
 sudo apt-get install \
    ca-certificates \
    curl \
    gnupg \
    lsb-release

2、添加docker官方GPG key。

sudo mkdir -p /etc/apt/keyrings
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg

3、设置软件仓库源

echo \
  "deb [arch=$(dpkg --print-architecture) signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu \
  $(lsb_release -cs) stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

3. 安装containerd

# 安装containerd
sudo apt-get update
sudo apt-get install -y containerd.io

# 如果是安装docker则执行:
sudo apt-get install docker-ce docker-ce-cli containerd.io

# 查看运行状态
systemctl enable containerd
systemctl status containerd

安装指定版本

# 查看版本
apt-cache madison containerd

# sudo apt-get install containerd=<VERSION>

4. 修改配置

在 Linux 上,containerd 的默认 CRI 套接字是 /run/containerd/containerd.sock

1、生成默认配置

containerd config default > /etc/containerd/config.toml

2、修改CgroupDriver为systemd

k8s官方推荐使用systemd类型的CgroupDriver。

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  ...
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true

3、重启containerd

systemctl restart containerd

2. 离线二进制安装containerd

containerdrunccni-pluginsnerdctl二进制下载到本地,再上传到对应服务器,解压文件到对应目录,修改containerd配置文件,启动containerd。

#!/bin/bash
set -e

ContainerdVersion=$1
ContainerdVersion=${ContainerdVersion:-1.6.6}

RuncVersion=$2
RuncVersion=${RuncVersion:-1.1.3}

CniVersion=$3
CniVersion=${CniVersion:-1.1.1}

NerdctlVersion=$4
NerdctlVersion=${NerdctlVersion:-0.21.0}

CrictlVersion=$5
CrictlVersion=${CrictlVersion:-1.24.2}

echo "--------------install containerd--------------"
wget https://github.com/containerd/containerd/releases/download/v${ContainerdVersion}/containerd-${ContainerdVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local containerd-${ContainerdVersion}-linux-amd64.tar.gz

echo "--------------install containerd service--------------"
wget https://raw.githubusercontent.com/containerd/containerd/681aaf68b7dcbe08a51c3372cbb8f813fb4466e0/containerd.service
mv containerd.service /lib/systemd/system/

mkdir -p /etc/containerd/
containerd config default > /etc/containerd/config.toml

echo "--------------install runc--------------"
wget https://github.com/opencontainers/runc/releases/download/v${RuncVersion}/runc.amd64
chmod +x runc.amd64
mv runc.amd64 /usr/local/bin/runc

echo "--------------install cni plugins--------------"
wget https://github.com/containernetworking/plugins/releases/download/v${CniVersion}/cni-plugins-linux-amd64-v${CniVersion}.tgz
rm -fr /opt/cni/bin
mkdir -p /opt/cni/bin
tar Cxzvf /opt/cni/bin cni-plugins-linux-amd64-v${CniVersion}.tgz

echo "--------------install nerdctl--------------"
wget https://github.com/containerd/nerdctl/releases/download/v${NerdctlVersion}/nerdctl-${NerdctlVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local/bin nerdctl-${NerdctlVersion}-linux-amd64.tar.gz

echo "--------------install crictl--------------"
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/v${CrictlVersion}/crictl-v${CrictlVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local/bin crictl-v${CrictlVersion}-linux-amd64.tar.gz

# 启动containerd服务
systemctl daemon-reload
systemctl restart contaienrd

参考:

12.2.2 -

crictl

#!/bin/bash
CrictlVersion=$5
CrictlVersion=${CrictlVersion:-0.21.0}

echo "--------------install crictl--------------"
wget https://github.com/kubernetes-sigs/cri-tools/releases/download/v${CrictlVersion}/crictl-v${CrictlVersion}-linux-amd64.tar.gz
tar Cxzvf /usr/local/bin nerdctl-${NerdctlVersion}-linux-amd64.tar.gz

设置配置文件

cat > /etc/crictl.yaml << EOF
runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 2
debug: false
pull-image-on-create: false
EOF

12.3 - Docker

12.3.1 -

Docker学习笔记

详见:Docker学习笔记

12.3.2 -

1. 创建Docker Client

​ Docker是一个client/server的架构,通过二进制文件docker创建Docker客户端将请求类型与参数发送给Docker Server,Docker Server具体执行命令调用。 Docker Client运行流程图如下:

说明:本文分析的代码为Docker 1.2.0版本。

1.1. Docker命令flag参数解析

Docker Server与Docker Client由可执行文件docker命令创建并启动。

  • Docker Server的启动:docker -d或docker --daemon=true
  • Docker Client的启动:docker --daemon=false ps等

docker参数分为两类:

  • 命令行参数(flag参数):--daemon=true,-d
  • 实际请求参数:ps ,images, pull, push等

/docker/docker.go

func main() {
       if reexec.Init() {
           return
       }
       flag.Parse()
       // FIXME: validate daemon flags here
       ......
   }

reexec.Init()作用:协调execdriver与容器创建时dockerinit的关系。如果返回值为真则直接退出运行,否则继续执行。判断reexec.Init()之后,调用flag.Parse()解析命令行中的flag参数。

/docker/flag.go

var (
      flVersion     = flag.Bool([]string{"v", "-version"}, false, "Print version information and quit")
      flDaemon      = flag.Bool([]string{"d", "-daemon"}, false, "Enable daemon mode")
      flDebug       = flag.Bool([]string{"D", "-debug"}, false, "Enable debug mode")
      flSocketGroup = flag.String([]string{"G", "-group"}, "docker", "Group to assign the unix socket specified by -H when running in daemon mode/nuse '' (the empty string) to disable setting of a group")
      flEnableCors  = flag.Bool([]string{"#api-enable-cors", "-api-enable-cors"}, false, "Enable CORS headers in the remote API")
      flTls         = flag.Bool([]string{"-tls"}, false, "Use TLS; implied by tls-verify flags")
      flTlsVerify   = flag.Bool([]string{"-tlsverify"}, false, "Use TLS and verify the remote (daemon: verify client, client: verify daemon)")
 
      // these are initialized in init() below since their default values depend on dockerCertPath which isn't fully initialized until init() runs
      flCa    *string
      flCert  *string
      flKey   *string
      flHosts []string
  )
 
  func init() {
      flCa = flag.String([]string{"-tlscacert"}, filepath.Join(dockerCertPath, defaultCaFile), "Trust only remotes providing a certificate signed by the CA given here")
      flCert = flag.String([]string{"-tlscert"}, filepath.Join(dockerCertPath, defaultCertFile), "Path to TLS certificate file")
      flKey = flag.String([]string{"-tlskey"}, filepath.Join(dockerCertPath, defaultKeyFile), "Path to TLS key file")
      opts.HostListVar(&flHosts, []string{"H", "-host"}, "The socket(s) to bind to in daemon mode/nspecified using one or more tcp://host:port, unix:///path/to/socket, fd://* or fd://socketfd.")
  }

flag.go定义了flag参数,并执行了init的初始化。

Go中的init函数

  1. 用于程序执行前包的初始化工作,比如初始化变量
  2. 每个包或源文件可以包含多个init函数
  3. init函数不能被调用,而是在mian函数调用前自动被调用
  4. 不同init函数的执行顺序,按照包导入的顺序执行

当解析到第一个非flag参数时,flag解析工作就结束。例如docker --daemon=flase --version=false ps

  • 完成flag的解析,--daemon=false
  • 遇到第一个非flag参数ps,则将ps及其后的参数存入flag.Args(),以便执行之后的具体请求。

1.2. 处理flag参数并收集Docker Client的配置信息

处理的flag参数有flVersion,flDebug,flDaemon,flTlsVerify以及flTls。

/docker/docker.go

func main() {
    ......
    if len(flHosts) == 0 {
        defaultHost := os.Getenv("DOCKER_HOST")
        if defaultHost == "" || *flDaemon {
            // If we do not have a host, default to unix socket
            defaultHost = fmt.Sprintf("unix://%s", api.DEFAULTUNIXSOCKET)
        }
        if _, err := api.ValidateHost(defaultHost); err != nil {
            log.Fatal(err)
        }
        flHosts = append(flHosts, defaultHost)
    }
    ......
}

flHosts的作用是为Docker Client提供所要连接的host对象,也就是为Docker Server提供所要监听的对象。 当flHosts为空,默认取环境变量DOCKER_HOST,若仍为空或flDaemon为真,则设置为unix socket,值为unix:///var/run/docker.sock。取自/api/common.go中的常量DEFAULTUNIXSOCKET。

/docker/docker.go

func main() {  
    ...
    if *flDaemon {
        mainDaemon()
        return
    }
    ...
}

若flDaemon为真,表示启动Docker Daemon,调用/docker/daemon.go中的func mainDaemon()。

/docker/docker.go

if len(flHosts) > 1 {
    log.Fatal("Please specify only one -H")
}
protoAddrParts := strings.SplitN(flHosts[0], "://", 2)

protoAddrParts的作用是解析出Docker Client 与Docker Server建立通信的协议与地址,通过strings.SplitN函数分割存储。flHosts[0]的值可以是tcp://0.0.0.0.2375或者unix:///var/run/docker.sock等。

/docker/docker.go

var (
    cli       *client.DockerCli
    tlsConfig tls.Config
)
tlsConfig.InsecureSkipVerify = true

tlsConfig对象的创建是为了保障cli在传输数据的时候遵循安全传输层协议(TLS)。flTlsVerity参数为真,则说明Docker Client 需Docker Server一起验证连接的安全性,如果flTls和flTlsVerity两个参数中有一个为真,则说明需要加载并发送客户端的证书。

/docker/flags.go

flTls         = flag.Bool([]string{"-tls"}, false, "Use TLS; implied by tls-verify flags")
flTlsVerify   = flag.Bool([]string{"-tlsverify"}, false, "Use TLS and verify the remote (daemon: verify client, client: verify daemon)")

1.3. 如何创建Docker Client

/docker/docker.go

if *flTls || *flTlsVerify {
    cli = client.NewDockerCli(os.Stdin, os.Stdout, os.Stderr, protoAddrParts[0], protoAddrParts[1], &tlsConfig)
} else {
    cli = client.NewDockerCli(os.Stdin, os.Stdout, os.Stderr, protoAddrParts[0], protoAddrParts[1], nil)
}

在已有配置参数的情况下,通过/api/client/cli.go中的NewDockerCli方法创建Docker Client实例cli。

/api/client/cli.go

type DockerCli struct {
    proto      string
    addr       string
    configFile *registry.ConfigFile
    in         io.ReadCloser
    out        io.Writer
    err        io.Writer
    isTerminal bool
    terminalFd uintptr
    tlsConfig  *tls.Config
    scheme     string
}
 
func NewDockerCli(in io.ReadCloser, out, err io.Writer, proto, addr string, tlsConfig *tls.Config) *DockerCli {
    var (
        isTerminal = false
        terminalFd uintptr
        scheme     = "http"
    )
 
    if tlsConfig != nil {
        scheme = "https"
    }
 
    if in != nil {
        if file, ok := out.(*os.File); ok {
            terminalFd = file.Fd()
            isTerminal = term.IsTerminal(terminalFd)
        }
    }
 
    if err == nil {
        err = out
    }
    return &DockerCli{
        proto:      proto,
        addr:       addr,
        in:         in,
        out:        out,
        err:        err,
        isTerminal: isTerminal,
        terminalFd: terminalFd,
        tlsConfig:  tlsConfig,
        scheme:     scheme,
    }
}

2. Docke命令执行

2.1. Docker Client解析请求命令

创建Docker Client,docker命令中的请求参数(例如ps,经flag解析后放入flag.Args()),分析请求参数及请求的类型,转义为Docker Server可识别的请求后发给Docker Server。

/docker/docker.go

if err := cli.Cmd(flag.Args()...); err != nil {
    if sterr, ok := err.(*utils.StatusError); ok {
        if sterr.Status != "" {
            log.Println(sterr.Status)
        }
        os.Exit(sterr.StatusCode)
    }
    log.Fatal(err)
}

解析flag.Args()的具体请求参数,执行cli.Cmd函数。代码在/api/client/cli.go

/api/client/cli.go

    // Cmd executes the specified command
    func (cli *DockerCli) Cmd(args ...string) error {
        if len(args) > 0 {
            method, exists := cli.getMethod(args[0])
            if !exists {
                fmt.Println("Error: Command not found:", args[0])
                return cli.CmdHelp(args[1:]...)
            }
            return method(args[1:]...)
        }
        return cli.CmdHelp(args...)
    }
 
method, exists := cli.getMethod(args[0])获取请求参数例如docker pull ImageNameargs[0]等于pull
 
    func (cli *DockerCli) getMethod(name string) (func(...string) error, bool) {
        if len(name) == 0 {
            return nil, false
        }
        methodName := "Cmd" + strings.ToUpper(name[:1]) + strings.ToLower(name[1:])
        method := reflect.ValueOf(cli).MethodByName(methodName)
        if !method.IsValid() {
            return nil, false
        }
        return method.Interface().(func(...string) error), true
    }

在getMethod中,返回method值为“CmdPull”。最后执行method(args[1:]...),即CmdPull(args[1:]...)。

2.2. Docker Client执行请求命令

docker pull ImageName中,即执行CmdPull(args[1:]...),args[1:]即为ImageName。命令代码在/api/client/command.go。

/api/client/commands.go

func (cli *DockerCli) CmdPull(args ...string) error {
    cmd := cli.Subcmd("pull", "NAME[:TAG]", "Pull an image or a repository from the registry")
    tag := cmd.String([]string{"#t", "#-tag"}, "", "Download tagged image in a repository")
    if err := cmd.Parse(args); err != nil {
        return nil
    }
    ...
}

将args参数进行第二次flag参数解析,解析过程中先提取是否有符合tag这个flag的参数,若有赋值给tag参数,其余存入cmd.NArg(),若没有则所有的参数存入cmd.NArg()中。

/api/client/commands.go

var (
     v      = url.Values{}
     remote = cmd.Arg(0)
 )
 
 v.Set("fromImage", remote)
 
 if *tag == "" {
     v.Set("tag", *tag)
 }
 
 remote, _ = parsers.ParseRepositoryTag(remote)
 // Resolve the Repository name from fqn to hostname + name
 hostname, _, err := registry.ResolveRepositoryName(remote)
 if err != nil {
     return err
 }

通过remote变量先得到镜像的repository名称,并赋值给remote自身,随后解析改变后的remote,得出镜像所在的host地址,即Docker Registry的地址。若没有指定默认为Docker Hub地址https://index.docker.io/v1/。

/api/client/commands.go

cli.LoadConfigFile()
 
// Resolve the Auth config relevant for this server
authConfig := cli.configFile.ResolveAuthConfig(hostname)

通过cli对象获取与Docker Server的认证配置信息。

/api/client/commands.go

pull := func(authConfig registry.AuthConfig) error {
    buf, err := json.Marshal(authConfig)
    if err != nil {
        return err
    }
    registryAuthHeader := []string{
        base64.URLEncoding.EncodeToString(buf),
    }
 
    return cli.stream("POST", "/images/create?"+v.Encode(), nil, cli.out, map[string][]string{
        "X-Registry-Auth": registryAuthHeader,
    })
}

定义pull函数:cli.stream("POST", "/images/create?"+v.Encode(),...)像Docker Server发送POST请求,请求url为“"/images/create?"+v.Encode()”,请求的认证信息为:map[string][]string{"X-Registry-Auth": registryAuthHeader,}

/api/client/commands.go

if err := pull(authConfig); err != nil {
    if strings.Contains(err.Error(), "Status 401") {
        fmt.Fprintln(cli.out, "/nPlease login prior to pull:")
        if err := cli.CmdLogin(hostname); err != nil {
            return err
        }
        authConfig := cli.configFile.ResolveAuthConfig(hostname)
        return pull(authConfig)
    }
    return err
}
 
return nil

调用pull函数,实现下载请求发送。后续有Docker Server接收到请求后具体实现。

参考:

  • 《Docker源码分析》

12.3.3 -

1. Docker Daemon架构示意图

Docker Daemon是Docker架构中运行在后台的守护进程,大致可以分为Docker Server、Engine和Job三部分。

Docker Daemon可以认为是通过Docker Server模块接受Docker Client的请求,并在Engine中处理请求,然后根据请求类型,创建出指定的Job并运行。

运行过程的作用有以下几种可能:

  • 向Docker Registry获取镜像,
  • 通过graphdriver执行容器镜像的本地化操作,
  • 通过networkdriver执行容器网络环境的配置,
  • 通过execdriver执行容器内部运行的执行工作等。

说明:本文分析的代码为Docker 1.2.0版本。

2. Docker Daemon启动流程图

启动Docker Daemon时,一般可以使用以下命令:docker --daemon=true; docker –d; docker –d=true等。接着由docker的main()函数来解析以上命令的相应flag参数,并最终完成Docker Daemon的启动。

/docker/docker.go

func main() {
    ...
    if *flDaemon {
        mainDaemon()
        return
    }
    ...
}

3. mainDaemon的具体实现

宏观来讲,mainDaemon()完成创建一个daemon进程,并使其正常运行。

从功能的角度来说,mainDaemon()实现了两部分内容:

  • 第一,创建Docker运行环境;
  • 第二,服务于Docker Client,接收并处理相应请求。

3.1. 配置初始化

/docker/daemon.go

var (
    daemonCfg = &daemon.Config{}
)
func init() {
    daemonCfg.InstallFlags()
}

在mainDaemon()运行之前,关于Docker Daemon所需要的config配置信息均已经初始化完毕。

声明一个为daemon包中Config类型的变量,名为daemonCfg。而Config对象,定义了Docker Daemon所需的配置信息。在Docker Daemon在启动时,daemonCfg变量被传递至Docker Daemon并被使用。

/daemon/config.go

type Config struct {
    Pidfile                  string   //Docker Daemon所属进程的PID文件
    Root                   string   //Docker运行时所使用的root路径
    AutoRestart             bool    //已被启用,转而支持docker run时的重启
    Dns                   []string  //Docker使用的DNS Server地址
    DnsSearch              []string  //Docker使用的指定的DNS查找域名
    Mirrors                 []string  //指定的优先Docker Registry镜像
    EnableIptables           bool    //启用Docker的iptables功能
    EnableIpForward         bool    //启用net.ipv4.ip_forward功能
    EnableIpMasq            bool      //启用IP伪装技术
    DefaultIp                net.IP     //绑定容器端口时使用的默认IP
    BridgeIface              string      //添加容器网络至已有的网桥
    BridgeIP                 string     //创建网桥的IP地址
    FixedCIDR               string     //指定IP的IPv4子网,必须被网桥子网包含
    InterContainerCommunication   bool  //是否允许相同host上容器间的通信
    GraphDriver             string      //Docker运行时使用的特定存储驱动
    GraphOptions            []string   //可设置的存储驱动选项
    ExecDriver               string    // Docker运行时使用的特定exec驱动
    Mtu                    int      //设置容器网络的MTU
    DisableNetwork          bool     //有定义,之后未初始化
    EnableSelinuxSupport      bool     //启用SELinux功能的支持
    Context                 map[string][]string   //有定义,之后未初始化
}

init()函数实现了daemonCfg变量中各属性的赋值,具体的实现为:daemonCfg.InstallFlags()

/daemon/config.go

// InstallFlags adds command-line options to the top-level flag parser for
// the current process.
// Subsequent calls to `flag.Parse` will populate config with values parsed
// from the command-line.
func (config *Config) InstallFlags() {
    flag.StringVar(&config.Pidfile, []string{"p", "-pidfile"}, "/var/run/docker.pid", "Path to use for daemon PID file")
    flag.StringVar(&config.Root, []string{"g", "-graph"}, "/var/lib/docker", "Path to use as the root of the Docker runtime")
    flag.BoolVar(&config.AutoRestart, []string{"#r", "#-restart"}, true, "--restart on the daemon has been deprecated infavor of --restart policies on docker run")
    flag.BoolVar(&config.EnableIptables, []string{"#iptables", "-iptables"}, true, "Enable Docker's addition of iptables rules")
    flag.BoolVar(&config.EnableIpForward, []string{"#ip-forward", "-ip-forward"}, true, "Enable net.ipv4.ip_forward")
    flag.StringVar(&config.BridgeIP, []string{"#bip", "-bip"}, "", "Use this CIDR notation address for the network bridge's IP, not compatible with -b")
    flag.StringVar(&config.BridgeIface, []string{"b", "-bridge"}, "", "Attach containers to a pre-existing network bridge/nuse 'none' to disable container networking")
    flag.BoolVar(&config.InterContainerCommunication, []string{"#icc", "-icc"}, true, "Enable inter-container communication")
    flag.StringVar(&config.GraphDriver, []string{"s", "-storage-driver"}, "", "Force the Docker runtime to use a specific storage driver")
    flag.StringVar(&config.ExecDriver, []string{"e", "-exec-driver"}, "native", "Force the Docker runtime to use a specific exec driver")
    flag.BoolVar(&config.EnableSelinuxSupport, []string{"-selinux-enabled"}, false, "Enable selinux support. SELinux does not presently support the BTRFS storage driver")
    flag.IntVar(&config.Mtu, []string{"#mtu", "-mtu"}, 0, "Set the containers network MTU/nif no value is provided: default to the default route MTU or 1500 if no default route is available")
    opts.IPVar(&config.DefaultIp, []string{"#ip", "-ip"}, "0.0.0.0", "Default IP address to use when binding container ports")
    opts.ListVar(&config.GraphOptions, []string{"-storage-opt"}, "Set storage driver options")
    // FIXME: why the inconsistency between "hosts" and "sockets"?
    opts.IPListVar(&config.Dns, []string{"#dns", "-dns"}, "Force Docker to use specific DNS servers")
    opts.DnsSearchListVar(&config.DnsSearch, []string{"-dns-search"}, "Force Docker to use specific DNS search domains")
}

在InstallFlags()函数的实现过程中,主要是定义某种类型的flag参数,并将该参数的值绑定在config变量的指定属性上,如:

flag.StringVar(&config.Pidfile, []string{"p", "-pidfile"}, " /var/run/docker.pid", "Path to use for daemon PID file")

以上语句的含义为:

  • 定义一个为String类型的flag参数;
  • 该flag的名称为”p”或者”-pidfile”;
  • 该flag的值为” /var/run/docker.pid”,并将该值绑定在变量config.Pidfile上;
  • 该flag的描述信息为"Path to use for daemon PID file"。

3.2. flag参数检查

/docker/daemon.go

if flag.NArg() != 0 {
    flag.Usage()
    return
}
  • 参数个数不为0,则说明在启动Docker Daemon的时候,传入了多余的参数,此时会输出错误提示,并退出运行程序。
  • 若为0,则说明Docker Daemon的启动命令无误,正常运行。

3.3. 创建engine对象

/docker/daemon.go

eng := engine.New()

Engine是Docker架构中的运行引擎,同时也是Docker运行的核心模块。Engine扮演着Docker container存储仓库的角色,并且通过job的形式来管理这些容器。

/engine/engine.go

type Engine struct {
    handlers   map[string]Handler
    catchall   Handler
    hack       Hack // data for temporary hackery (see hack.go)
    id         string
    Stdout     io.Writer
    Stderr     io.Writer
    Stdin      io.Reader
    Logging    bool
    tasks      sync.WaitGroup
    l          sync.RWMutex // lock for shutdown
    shutdown   bool
    onShutdown []func() // shutdown handlers
}

Engine结构体中最为重要的即为handlers属性。该handlers属性为map类型,key为string类型,value为Handler类型。Handler为一个定义的函数。该函数传入的参数为Job指针,返回为Status状态。

/engine/engine.go

type Handler func(*Job) Status

New()函数的实现:

/engine/engine.go

// New initializes a new engine.
func New() *Engine {
    eng := &Engine{
        handlers: make(map[string]Handler),
        id:       utils.RandomString(),
        Stdout:   os.Stdout,
        Stderr:   os.Stderr,
        Stdin:    os.Stdin,
        Logging:  true,
    }
    eng.Register("commands", func(job *Job) Status {
        for _, name := range eng.commands() {
            job.Printf("%s/n", name)
        }
        return StatusOK
    })
    // Copy existing global handlers
    for k, v := range globalHandlers {
        eng.handlers[k] = v
    }
    return eng
}
  1. 创建一个Engine结构体实例eng
  2. 向eng对象注册名为commands的Handler,其中Handler为临时定义的函数func(job *Job) Status{ } , 该函数的作用是通过job来打印所有已经注册完毕的command名称,最终返回状态StatusOK。
  3. 将已定义的变量globalHandlers中的所有的Handler,都复制到eng对象的handlers属性中。最后成功返回eng对象。

3.4. 设置engine的信号捕获

/daemon/daemon.go

signal.Trap(eng.Shutdown)

在Docker Daemon的运行中,设置Trap特定信号的处理方法,特定信号有SIGINT,SIGTERM以及SIGQUIT;当程序捕获到SIGINT或者SIGTERM信号时,执行相应的善后操作,最后保证Docker Daemon程序退出。

/pkg/signal/trap.go

//Trap sets up a simplified signal "trap", appropriate for common
// behavior expected from a vanilla unix command-line tool in general
// (and the Docker engine in particular).
//
// * If SIGINT or SIGTERM are received, `cleanup` is called, then the process is terminated.
// * If SIGINT or SIGTERM are repeated 3 times before cleanup is complete, then cleanup is
// skipped and the process terminated directly.
// * If "DEBUG" is set in the environment, SIGQUIT causes an exit without cleanup.
//
func Trap(cleanup func()) {
    c := make(chan os.Signal, 1)
    signals := []os.Signal{os.Interrupt, syscall.SIGTERM}
    if os.Getenv("DEBUG") == "" {
        signals = append(signals, syscall.SIGQUIT)
    }
    gosignal.Notify(c, signals...)
    go func() {
        interruptCount := uint32(0)
        for sig := range c {
            go func(sig os.Signal) {
                log.Printf("Received signal '%v', starting shutdown of docker.../n", sig)
                switch sig {
                case os.Interrupt, syscall.SIGTERM:
                    // If the user really wants to interrupt, let him do so.
                    if atomic.LoadUint32(&interruptCount) < 3 {
                        atomic.AddUint32(&interruptCount, 1)
                        // Initiate the cleanup only once
                        if atomic.LoadUint32(&interruptCount) == 1 {
                            // Call cleanup handler
                            cleanup()
                            os.Exit(0)
                        } else {
                            return
                        }
                    } else {
                        log.Printf("Force shutdown of docker, interrupting cleanup/n")
                    }
                case syscall.SIGQUIT:
                }
                os.Exit(128 + int(sig.(syscall.Signal)))
            }(sig)
        }
    }()
} 
  • 创建并设置一个channel,用于发送信号通知;
  • 定义signals数组变量,初始值为os.SIGINT, os.SIGTERM;若环境变量DEBUG为空的话,则添加os.SIGQUIT至signals数组;
  • 通过gosignal.Notify(c, signals...)中Notify函数来实现将接收到的signal信号传递给c。需要注意的是只有signals中被罗列出的信号才会被传递给c,其余信号会被直接忽略;
  • 创建一个goroutine来处理具体的signal信号,当信号类型为os.Interrupt或者syscall.SIGTERM时,执行传入Trap函数的具体执行方法,形参为cleanup(),实参为eng.Shutdown。

Shutdown()函数的定义位于./docker/engine/engine.go,主要做的工作是为Docker Daemon的关闭做一些善后工作。

/engine/engine.go

// Shutdown permanently shuts down eng as follows:
// - It refuses all new jobs, permanently.
// - It waits for all active jobs to complete (with no timeout)
// - It calls all shutdown handlers concurrently (if any)
// - It returns when all handlers complete, or after 15 seconds,
//    whichever happens first.
func (eng *Engine) Shutdown() {
    eng.l.Lock()
    if eng.shutdown {
        eng.l.Unlock()
        return
    }
    eng.shutdown = true
    eng.l.Unlock()
    // We don't need to protect the rest with a lock, to allow
    // for other calls to immediately fail with "shutdown" instead
    // of hanging for 15 seconds.
    // This requires all concurrent calls to check for shutdown, otherwise
    // it might cause a race.
    // Wait for all jobs to complete.
    // Timeout after 5 seconds.
    tasksDone := make(chan struct{})
    go func() {
        eng.tasks.Wait()
        close(tasksDone)
    }()
    select {
    case <-time.After(time.Second * 5):
    case <-tasksDone:
    }
    // Call shutdown handlers, if any.
    // Timeout after 10 seconds.
    var wg sync.WaitGroup
    for _, h := range eng.onShutdown {
        wg.Add(1)
        go func(h func()) {
            defer wg.Done()
            h()
        }(h)
    }
    done := make(chan struct{})
    go func() {
        wg.Wait()
        close(done)
    }()
    select {
    case <-time.After(time.Second * 10):
    case <-done:
    }
    return
}
  • Docker Daemon不再接收任何新的Job;
  • Docker Daemon等待所有存活的Job执行完毕;
  • Docker Daemon调用所有shutdown的处理方法;
  • 当所有的handler执行完毕,或者15秒之后,Shutdown()函数返回。

由于在signal.Trap( eng.Shutdown )函数的具体实现中执行eng.Shutdown,在执行完eng.Shutdown之后,随即执行os.Exit(0),完成当前程序的立即退出。

3.5. 加载builtins

/docker/daemon.go

if err := builtins.Register(eng); err != nil {
    log.Fatal(err)
}

为engine注册多个Handler,以便后续在执行相应任务时,运行指定的Handler。

这些Handler包括:

  • 网络初始化、
  • web API服务、
  • 事件查询、
  • 版本查看、
  • Docker Registry验证与搜索。

/builtins/builtins.go

func Register(eng *engine.Engine) error {
    if err := daemon(eng); err != nil {
        return err
    }
    if err := remote(eng); err != nil {
        return err
    }
    if err := events.New().Install(eng); err != nil {
        return err
    }
    if err := eng.Register("version", dockerVersion); err != nil {
        return err
    }
    return registry.NewService().Install(eng)
}

3.5.1. 注册初始化网络驱动的Handler

daemon(eng)的实现过程,主要为eng对象注册了一个key为”init_networkdriver”的Handler,该Handler的值为bridge.InitDriver函数,代码如下:

/builtins/builtins.go

func daemon(eng *engine.Engine) error {
    return eng.Register("init_networkdriver", bridge.InitDriver)
}

需要注意的是,向eng对象注册Handler,并不代表Handler的值函数会被直接运行,如bridge.InitDriver,并不会直接运行,而是将bridge.InitDriver的函数入口,写入eng的handlers属性中。

/daemon/networkdriver/bridge/driver.go

func InitDriver(job *engine.Job) engine.Status {
    var (
        network        *net.IPNet
        enableIPTables = job.GetenvBool("EnableIptables")
        icc            = job.GetenvBool("InterContainerCommunication")
        ipForward      = job.GetenvBool("EnableIpForward")
        bridgeIP       = job.Getenv("BridgeIP")
    )
 
    if defaultIP := job.Getenv("DefaultBindingIP"); defaultIP != "" {
        defaultBindingIP = net.ParseIP(defaultIP)
    }
 
    bridgeIface = job.Getenv("BridgeIface")
    usingDefaultBridge := false
    if bridgeIface == "" {
        usingDefaultBridge = true
        bridgeIface = DefaultNetworkBridge
    }
 
    addr, err := networkdriver.GetIfaceAddr(bridgeIface)
    if err != nil {
        // If we're not using the default bridge, fail without trying to create it
        if !usingDefaultBridge {
            job.Logf("bridge not found: %s", bridgeIface)
            return job.Error(err)
        }
        // If the iface is not found, try to create it
        job.Logf("creating new bridge for %s", bridgeIface)
        if err := createBridge(bridgeIP); err != nil {
            return job.Error(err)
        }
 
        job.Logf("getting iface addr")
        addr, err = networkdriver.GetIfaceAddr(bridgeIface)
        if err != nil {
            return job.Error(err)
        }
        network = addr.(*net.IPNet)
    } else {
        network = addr.(*net.IPNet)
        // validate that the bridge ip matches the ip specified by BridgeIP
        if bridgeIP != "" {
            bip, _, err := net.ParseCIDR(bridgeIP)
            if err != nil {
                return job.Error(err)
            }
            if !network.IP.Equal(bip) {
                return job.Errorf("bridge ip (%s) does not match existing bridge configuration %s", network.IP, bip)
            }
        }
    }
 
    // Configure iptables for link support
    if enableIPTables {
        if err := setupIPTables(addr, icc); err != nil {
            return job.Error(err)
        }
    }
 
    if ipForward {
        // Enable IPv4 forwarding
        if err := ioutil.WriteFile("/proc/sys/net/ipv4/ip_forward", []byte{'1', '/n'}, 0644); err != nil {
            job.Logf("WARNING: unable to enable IPv4 forwarding: %s/n", err)
        }
    }
 
    // We can always try removing the iptables
    if err := iptables.RemoveExistingChain("DOCKER"); err != nil {
        return job.Error(err)
    }
 
    if enableIPTables {
        chain, err := iptables.NewChain("DOCKER", bridgeIface)
        if err != nil {
            return job.Error(err)
        }
        portmapper.SetIptablesChain(chain)
    }
 
    bridgeNetwork = network
 
    // https://github.com/docker/docker/issues/2768
    job.Eng.Hack_SetGlobalVar("httpapi.bridgeIP", bridgeNetwork.IP)
 
    for name, f := range map[string]engine.Handler{
        "allocate_interface": Allocate,
        "release_interface":  Release,
        "allocate_port":      AllocatePort,
        "link":               LinkContainers,
    } {
        if err := job.Eng.Register(name, f); err != nil {
            return job.Error(err)
        }
    }
    return engine.StatusOK
}

Bridge.InitDriver的作用:

  • 获取为Docker服务的网络设备的地址;
  • 创建指定IP地址的网桥;
  • 配置网络iptables规则;
  • 另外还为eng对象注册了多个Handler,如 ”allocate_interface”, ”release_interface”, ”allocate_port”,”link”。

3.5.2. 注册API服务的Handler

remote(eng)的实现过程,主要为eng对象注册了两个Handler,分别为”serveapi”与”acceptconnections”。代码实现如下:

/builtins/builtins.go

func remote(eng *engine.Engine) error {
    if err := eng.Register("serveapi", apiserver.ServeApi); err != nil {
        return err
    }
    return eng.Register("acceptconnections", apiserver.AcceptConnections)
}

注册的两个Handler名称分别为”serveapi”与”acceptconnections”

  • ServeApi执行时,通过循环多种协议,创建出goroutine来配置指定的http.Server,最终为不同的协议请求服务;
  • AcceptConnections的实现主要是为了通知init守护进程,Docker Daemon已经启动完毕,可以让Docker Daemon进程接受请求。(守护进程)

3.5.3. 注册events事件的Handler

events.New().Install(eng)的实现过程,为Docker注册了多个event事件,功能是给Docker用户提供API,使得用户可以通过这些API查看Docker内部的events信息,log信息以及subscribers_count信息。

/events/events.go

type Events struct {
    mu          sync.RWMutex
    events      []*utils.JSONMessage
    subscribers []listener
}
func New() *Events {
    return &Events{
        events: make([]*utils.JSONMessage, 0, eventsLimit),
    }
}
// Install installs events public api in docker engine
func (e *Events) Install(eng *engine.Engine) error {
    // Here you should describe public interface
    jobs := map[string]engine.Handler{
        "events":            e.Get,
        "log":               e.Log,
        "subscribers_count": e.SubscribersCount,
    }
    for name, job := range jobs {
        if err := eng.Register(name, job); err != nil {
            return err
        }
    }
    return nil
}

3.5.4. 注册版本的Handler

eng.Register(“version”,dockerVersion)的实现过程,向eng对象注册key为”version”,value为”dockerVersion”执行方法的Handler,dockerVersion的执行过程中,会向名为version的job的标准输出中写入Docker的版本,Docker API的版本,git版本,Go语言运行时版本以及操作系统等版本信息。

/builtins/builtins.go

// builtins jobs independent of any subsystem
func dockerVersion(job *engine.Job) engine.Status {
    v := &engine.Env{}
    v.SetJson("Version", dockerversion.VERSION)
    v.SetJson("ApiVersion", api.APIVERSION)
    v.Set("GitCommit", dockerversion.GITCOMMIT)
    v.Set("GoVersion", runtime.Version())
    v.Set("Os", runtime.GOOS)
    v.Set("Arch", runtime.GOARCH)
    if kernelVersion, err := kernel.GetKernelVersion(); err == nil {
        v.Set("KernelVersion", kernelVersion.String())
    }
    if _, err := v.WriteTo(job.Stdout); err != nil {
        return job.Error(err)
    }
    return engine.StatusOK
}

3.5.5. 注册registry的Handler

registry.NewService().Install(eng)的实现过程位于./docker/registry/service.go,在eng对象对外暴露的API信息中添加docker registry的信息。当registry.NewService()成功被Install安装完毕的话,则有两个调用能够被eng使用:”auth”,向公有registry进行认证;”search”,在公有registry上搜索指定的镜像。

/registry/service.go

// NewService returns a new instance of Service ready to be
// installed no an engine.
func NewService() *Service {
    return &Service{}
}
// Install installs registry capabilities to eng.
func (s *Service) Install(eng *engine.Engine) error {
    eng.Register("auth", s.Auth)
    eng.Register("search", s.Search)
    return nil
}

3.6. 使用goroutine加载daemon对象

执行完builtins的加载,回到mainDaemon()的执行,通过一个goroutine来加载daemon对象并开始运行。这一环节的执行,主要包含三个步骤:

  • 通过init函数中初始化的daemonCfg与eng对象来创建一个daemon对象d;(守护进程)
  • 通过daemon对象的Install函数,向eng对象中注册众多的Handler;
  • 在Docker Daemon启动完毕之后,运行名为”acceptconnections”的job,主要工作为向init守护进程发送”READY=1”信号,以便开始正常接受请求。

/docker/daemon.go

// load the daemon in the background so we can immediately start
// the http api so that connections don't fail while the daemon
// is booting
go func() {
    d, err := daemon.NewDaemon(daemonCfg, eng)
    if err != nil {
        log.Fatal(err)
    }
    if err := d.Install(eng); err != nil {
        log.Fatal(err)
    }
    // after the daemon is done setting up we can tell the api to start
    // accepting connections
    if err := eng.Job("acceptconnections").Run(); err != nil {
        log.Fatal(err)
    }
}()

3.6.1. 创建daemon对象

/docker/daemon.go

d, err := daemon.NewDaemon(daemonCfg, eng)
if err != nil {
    log.Fatal(err)
}

daemon.NewDaemon(daemonCfg, eng)是创建daemon对象d的核心部分。主要作用为初始化Docker Daemon的基本环境,如处理config参数,验证系统支持度,配置Docker工作目录,设置与加载多种driver,创建graph环境等,验证DNS配置等。具体参考NewDaemon

3.6.2. 通过daemon对象为engine注册Handler

当创建完daemon对象,goroutine执行d.Install(eng)

/daemon/daemon.go

type Daemon struct {
    repository     string
    sysInitPath    string
    containers     *contStore
    graph          *graph.Graph
    repositories   *graph.TagStore
    idIndex        *truncindex.TruncIndex
    sysInfo        *sysinfo.SysInfo
    volumes        *graph.Graph
    eng            *engine.Engine
    config         *Config
    containerGraph *graphdb.Database
    driver         graphdriver.Driver
    execDriver     execdriver.Driver
}
// Install installs daemon capabilities to eng.
func (daemon *Daemon) Install(eng *engine.Engine) error {
    // FIXME: rename "delete" to "rm" for consistency with the CLI command
    // FIXME: rename ContainerDestroy to ContainerRm for consistency with the CLI command
    // FIXME: remove ImageDelete's dependency on Daemon, then move to graph/
    for name, method := range map[string]engine.Handler{
        "attach":            daemon.ContainerAttach,
        "build":             daemon.CmdBuild,
        "commit":            daemon.ContainerCommit,
        "container_changes": daemon.ContainerChanges,
        "container_copy":    daemon.ContainerCopy,
        "container_inspect": daemon.ContainerInspect,
        "containers":        daemon.Containers,
        "create":            daemon.ContainerCreate,
        "delete":            daemon.ContainerDestroy,
        "export":            daemon.ContainerExport,
        "info":              daemon.CmdInfo,
        "kill":              daemon.ContainerKill,
        "logs":              daemon.ContainerLogs,
        "pause":             daemon.ContainerPause,
        "resize":            daemon.ContainerResize,
        "restart":           daemon.ContainerRestart,
        "start":             daemon.ContainerStart,
        "stop":              daemon.ContainerStop,
        "top":               daemon.ContainerTop,
        "unpause":           daemon.ContainerUnpause,
        "wait":              daemon.ContainerWait,
        "image_delete":      daemon.ImageDelete, // FIXME: see above
    } {
        if err := eng.Register(name, method); err != nil {
            return err
        }
    }
    if err := daemon.Repositories().Install(eng); err != nil {
        return err
    }
    // FIXME: this hack is necessary for legacy integration tests to access
    // the daemon object.
    eng.Hack_SetGlobalVar("httpapi.daemon", daemon)
    return nil
}

以上代码的实现分为三部分:

  • 向eng对象中注册众多的Handler对象;
  • daemon.Repositories().Install(eng)实现了向eng对象注册多个与image相关的Handler,Install的实现位于./docker/graph/service.go
  • eng.Hack_SetGlobalVar("httpapi.daemon", daemon)实现向eng对象中map类型的hack对象中添加一条记录,key为”httpapi.daemon”,value为daemon。

3.6.3. 运行acceptconnections的job

/docker/daemon.go

if err := eng.Job("acceptconnections").Run(); err != nil {
    log.Fatal(err)
}

在goroutine内部最后运行名为”acceptconnections”的job,主要作用是通知init守护进程,Docker Daemon可以开始接受请求了。

首先执行eng.Job("acceptconnections"),返回一个Job,随后再执行eng.Job("acceptconnections").Run(),也就是该执行Job的run函数。

/engine/engine.go

// Job creates a new job which can later be executed.
// This function mimics `Command` from the standard os/exec package.
func (eng *Engine) Job(name string, args ...string) *Job {
    job := &Job{
        Eng:    eng,
        Name:   name,
        Args:   args,
        Stdin:  NewInput(),
        Stdout: NewOutput(),
        Stderr: NewOutput(),
        env:    &Env{},
    }
    if eng.Logging {
        job.Stderr.Add(utils.NopWriteCloser(eng.Stderr))
    }
    // Catchall is shadowed by specific Register.
    if handler, exists := eng.handlers[name]; exists {
        job.handler = handler
    } else if eng.catchall != nil && name != "" {
        // empty job names are illegal, catchall or not.
        job.handler = eng.catchall
    }
    return job
} 
  1. 首先创建一个类型为Job的job对象,该对象中Eng属性为函数的调用者eng,Name属性为”acceptconnections”,没有参数传入。
  2. 另外在eng对象所有的handlers属性中寻找键为”acceptconnections”记录的值,由于在加载builtins操作中的remote(eng)中已经向eng注册过这样的一条记录,key为”acceptconnections”,value为apiserver.AcceptConnections。
  3. 因此job对象的handler为apiserver.AcceptConnections。
  4. 最后返回已经初始化完毕的对象job。

创建完job对象之后,随即执行该job对象的run()函数。

/engine/job.go

// A job is the fundamental unit of work in the docker engine.
// Everything docker can do should eventually be exposed as a job.
// For example: execute a process in a container, create a new container,
// download an archive from the internet, serve the http api, etc.
//
// The job API is designed after unix processes: a job has a name, arguments,
// environment variables, standard streams for input, output and error, and
// an exit status which can indicate success (0) or error (anything else).
//
// One slight variation is that jobs report their status as a string. The
// string "0" indicates success, and any other strings indicates an error.
// This allows for richer error reporting.
//
type Job struct {
    Eng     *Engine
    Name    string
    Args    []string
    env     *Env
    Stdout  *Output
    Stderr  *Output
    Stdin   *Input
    handler Handler
    status  Status
    end     time.Time
}
type Status int
const (
    StatusOK       Status = 0
    StatusErr      Status = 1
    StatusNotFound Status = 127
)
// Run executes the job and blocks until the job completes.
// If the job returns a failure status, an error is returned
// which includes the status.
func (job *Job) Run() error {
    if job.Eng.IsShutdown() {
        return fmt.Errorf("engine is shutdown")
    }
    // FIXME: this is a temporary workaround to avoid Engine.Shutdown
    // waiting 5 seconds for server/api.ServeApi to complete (which it never will)
    // everytime the daemon is cleanly restarted.
    // The permanent fix is to implement Job.Stop and Job.OnStop so that
    // ServeApi can cooperate and terminate cleanly.
    if job.Name != "serveapi" {
        job.Eng.l.Lock()
        job.Eng.tasks.Add(1)
        job.Eng.l.Unlock()
        defer job.Eng.tasks.Done()
    }
    // FIXME: make this thread-safe
    // FIXME: implement wait
    if !job.end.IsZero() {
        return fmt.Errorf("%s: job has already completed", job.Name)
    }
    // Log beginning and end of the job
    job.Eng.Logf("+job %s", job.CallString())
    defer func() {
        job.Eng.Logf("-job %s%s", job.CallString(), job.StatusString())
    }()
    var errorMessage = bytes.NewBuffer(nil)
    job.Stderr.Add(errorMessage)
    if job.handler == nil {
        job.Errorf("%s: command not found", job.Name)
        job.status = 127
    } else {
        job.status = job.handler(job)
        job.end = time.Now()
    }
    // Wait for all background tasks to complete
    if err := job.Stdout.Close(); err != nil {
        return err
    }
    if err := job.Stderr.Close(); err != nil {
        return err
    }
    if err := job.Stdin.Close(); err != nil {
        return err
    }
    if job.status != 0 {
        return fmt.Errorf("%s", Tail(errorMessage, 1))
    }
    return nil
}

Run()函数的实现位于./docker/engine/job.go,该函数执行指定的job,并在job执行完成前一直阻塞。对于名为”acceptconnections”的job对象,运行代码为job.status = job.handler(job),由于job.handler值为apiserver.AcceptConnections,故真正执行的是job.status = apiserver.AcceptConnections(job)。

进入AcceptConnections的具体实现,位于./docker/api/server/server.go,如下:

/api/server/server.go

func AcceptConnections(job *engine.Job) engine.Status {
    // Tell the init daemon we are accepting requests
    go  systemd.SdNotify("READY=1")
    if activationLock != nil {
        close(activationLock)
    }
    return engine.StatusOK
}

重点为go systemd.SdNotify("READY=1")的实现,位于./docker/pkg/system/sd_notify.go,主要作用是通知init守护进程Docker Daemon的启动已经全部完成,潜在的功能是使得Docker Daemon开始接受Docker Client发送来的API请求。

至此,已经完成通过goroutine来加载daemon对象并运行。

3.7. 打印Docker版本及驱动信息

显示docker的版本信息,以及ExecDriver和GraphDriver这两个驱动的具体信息

/docker/daemon.go

// TODO actually have a resolved graphdriver to show?
log.Printf("docker daemon: %s %s; execdriver: %s; graphdriver: %s",
    dockerversion.VERSION,
    dockerversion.GITCOMMIT,
    daemonCfg.ExecDriver,
    daemonCfg.GraphDriver,
)

3.8. serveapi的创建与运行

打印部分Docker具体信息之后,Docker Daemon立即创建并运行名为”serveapi”的job,主要作用为让Docker Daemon提供API访问服务。

/docker/daemon.go

// Serve api
job := eng.Job("serveapi", flHosts...)
job.SetenvBool("Logging", true)
job.SetenvBool("EnableCors", *flEnableCors)
job.Setenv("Version", dockerversion.VERSION)
job.Setenv("SocketGroup", *flSocketGroup)
job.SetenvBool("Tls", *flTls)
job.SetenvBool("TlsVerify", *flTlsVerify)
job.Setenv("TlsCa", *flCa)
job.Setenv("TlsCert", *flCert)
job.Setenv("TlsKey", *flKey)
job.SetenvBool("BufferRequests", true)
if err := job.Run(); err != nil {
    log.Fatal(err)
}
  1. 创建一个名为”serveapi”的job,并将flHosts的值赋给job.Args。flHost的作用主要是为Docker Daemon提供使用的协议与监听的地址。
  2. Docker Daemon为该job设置了众多的环境变量,如安全传输层协议的环境变量等。最后通过job.Run()运行该serveapi的job。

由于在eng中key为”serveapi”的handler,value为apiserver.ServeApi,故该job运行时,执行apiserver.ServeApi函数,位于./docker/api/server/server.go。ServeApi函数的作用主要是对于用户定义的所有支持协议,Docker Daemon均创建一个goroutine来启动相应的http.Server,分别为不同的协议服务。具体参考Docker Server

参考:

  • 《Docker源码分析》

12.3.4 -

1. Docker Server创建流程

Docker Server是Daemon Server的重要组成部分,功能:接收Docker Client发送的请求,并按照相应的路由规则实现请求的路由分发,最终将请求处理的结果返回给Docker Client。 Docker Daemon启动,在mainDaemon()运行的最后创建并运行serverapi的Job,让Docker Daemon提供API访问服务。 Docker Server的整个生命周期

  1. 创建Docker Server的Job
  2. 配置Job的环境变量
  3. 触发执行Job

说明:本文分析的代码为Docker 1.2.0版本。

1.1. 创建“serverapi”的Job

/docker/daemon.go

func mainDaemon() {
      ...
      // Serve api
      job := eng.Job("serveapi", flHosts...)
      ...
  }

运行serverapi的Job时,会执行该Job的处理方法api.ServeApi。

1.2. 配置Job环境变量

/docker/daemon.go

job.SetenvBool("Logging", true)
job.SetenvBool("EnableCors", *flEnableCors)
job.Setenv("Version", dockerversion.VERSION)
job.Setenv("SocketGroup", *flSocketGroup)
 
job.SetenvBool("Tls", *flTls)
job.SetenvBool("TlsVerify", *flTlsVerify)
job.Setenv("TlsCa", *flCa)
job.Setenv("TlsCert", *flCert)
job.Setenv("TlsKey", *flKey)
job.SetenvBool("BufferRequests", true)

参数分为两种

  • 创建Job实例时,用指定参数直接初始化Job的Args属性
  • 创建Job后,给Job添加指定的环境变量
环境变量名 FLAG参数 默认 作用值
Logging true 启用Docker容器的日志输出
EnableCors flEnableCors false 在远程API中提供CORS头
Version 显示Docker版本号
SocketGroup flSockerGroup docker 在daemon模式中unix domain socket分配用户组名
Tls flTls false 使用TLS安全传输协议
TlsVerify flTlsVerify false 使用TLS并验证远程客户端
TlsCa flCa 指定CA文件路径
TlsCert flCert TLS证书文件路径
TlsKey flKey TLS密钥文件路径
BufferRequest true 缓存Docker Client请求

1.3. 运行Job

/api/server/server.go

if err := job.Run(); err != nil {
    log.Fatal(err)
}

Docker在eng对象中注册过键位serverapi的处理方法,在运行Job的时候执行这个处理方法的值函数,相应的处理方法的值为api.ServeApi。

2. ServeApi运行流程

​ ServeApi属于Docker Server提供API服务的部分,作为一个监听请求、处理请求、响应请求的服务端,支持三种协议:TCP协议、UNIX Socket形式以及fd的形式。功能是:循环检查Docker Daemon支持的所有协议,并为每一种协议创建一个协程goroutine,并在协程内部配置一个服务于HTTP请求的服务端。

/api/server/server.go

// ServeApi loops through all of the protocols sent in to docker and spawns
// off a go routine to setup a serving http.Server for each.
func ServeApi(job *engine.Job) engine.Status {
    if len(job.Args) == 0 {
        return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
    }
    var (
        protoAddrs = job.Args
        chErrors   = make(chan error, len(protoAddrs))
    )
    activationLock = make(chan struct{})
 
    for _, protoAddr := range protoAddrs {
        protoAddrParts := strings.SplitN(protoAddr, "://", 2)
        if len(protoAddrParts) != 2 {
            return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
        }
        go func() {
            log.Infof("Listening for HTTP on %s (%s)", protoAddrParts[0], protoAddrParts[1])
            chErrors <- ListenAndServe(protoAddrParts[0], protoAddrParts[1], job)
        }()
    }
 
    for i := 0; i < len(protoAddrs); i += 1 {
        err := <-chErrors
        if err != nil {
            return job.Error(err)
        }
    }
 
    return engine.StatusOK
}

ServeApi执行流程:

  1. 检查Job参数,确保传入参数无误
  2. 定义Docker Server的监听协议与地址,以及错误信息管理channel
  3. 遍历协议地址,针对协议创建相应的服务端
  4. 通过chErrors建立goroutine与主进程之间的协调关系

2.1. 判断Job参数

判断Job参数,job.Args,即数组flHost,若flHost的长度为0,则说明没有监听的协议与地址,参数有误。

/api/server/server.go

func ServeApi(job *engine.Job) engine.Status {
    if len(job.Args) == 0 {
        return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
    }
    ...
}

2.2. 定义监听协议与地址及错误信息

/api/server/server.go

var (
       protoAddrs = job.Args
       chErrors   = make(chan error, len(protoAddrs))
   )
   activationLock = make(chan struct{})

定义protoAddrs[flHosts的内容]、chErrors[错误类型管道]与activationLock[同步serveapi和acceptconnections两个job执行的管道]三个变量,

2.3. 遍历协议地址

/api/server/server.go

for _, protoAddr := range protoAddrs {
    protoAddrParts := strings.SplitN(protoAddr, "://", 2)
    if len(protoAddrParts) != 2 {
        return job.Errorf("usage: %s PROTO://ADDR [PROTO://ADDR ...]", job.Name)
    }
    go func() {
        log.Infof("Listening for HTTP on %s (%s)", protoAddrParts[0], protoAddrParts[1])
        chErrors <- ListenAndServe(protoAddrParts[0], protoAddrParts[1], job)
    }()
}

遍历协议地址,针对协议创建相应的服务端。协议地址

2.4. 协调chErrors与主进程关系

根据chErrors的值运行,如果chErrors这个管道中有错误内容,则ServerApi一次循环结束,若无错误内容,循环被阻塞。即chErrors确保ListenAndServe所对应的协程能和主函数ServeApi进行协调,如果协程出错,主函数ServeApi仍然可以捕获这样的错误,从而导致程序退出。

/api/server/server.go

for i := 0; i < len(protoAddrs); i += 1 {
    err := <-chErrors
    if err != nil {
        return job.Error(err)
    }
}
return engine.StatusOK

3. ListenAndServe实现

ListenAndServe的功能:使Docker Server监听某一指定地址,并接收该地址的请求,并对以上请求路由转发至相应的处理方法处。 ListenAndServe执行流程:

  1. 创建route路由实例
  2. 创建listener监听实例
  3. 创建http.Server
  4. 启动API服务

流程图:

3.1. 创建route路由实例

/api/server/server.go

// ListenAndServe sets up the required http.Server and gets it listening for
// each addr passed in and does protocol specific checking.
func ListenAndServe(proto, addr string, job *engine.Job) error {
    var l net.Listener
    r, err := createRouter(job.Eng, job.GetenvBool("Logging"), job.GetenvBool("EnableCors"), job.Getenv("Version"))
    if err != nil {
        return err
    }
    ...
}

路由实例的作用:负责Docker Server对外部请求的路由及转发。 实现过程:

  1. 创建全新的route路由实例
  2. 为route实例添加路由记录

3.1.1. 创建空路由实例

/api/server/server.go

func createRouter(eng *engine.Engine, logging, enableCors bool, dockerVersion string) (*mux.Router, error) {
     r := mux.NewRouter()
     ...
 }

/vendor/src/github.com/gorilla/mux/mux.go

// NewRouter returns a new router instance.
func NewRouter() *Router {
    return &Router{namedRoutes: make(map[string]*Route), KeepContext: false}
}
 
// This will send all incoming requests to the router.
type Router struct {
    // Configurable Handler to be used when no route matches.
    NotFoundHandler http.Handler
    // Parent route, if this is a subrouter.
    parent parentRoute
    // Routes to be matched, in order.
    routes []*Route
    // Routes by name for URL building.
    namedRoutes map[string]*Route
    // See Router.StrictSlash(). This defines the flag for new routes.
    strictSlash bool
    // If true, do not clear the request context after handling the request
    KeepContext bool
}

NewRoute()函数返回一个全新的route实例r,类型为mux.Router。实例初始化nameRoutes和KeepContext。

  • nameRoutes:map类型,key为string类型,value为Route路由记录类型
  • KeepContext:属性为false,则处理完请求后清除请求内容,不对请求做存储操作

mux.Router会通过一系列已经注册过的路由记录,来匹配接收的请求。先通过请求的URL或者其他条件找到相应的路由记录,并调用这条记录中的执行处理方法。 mux.Router特性

  • 请求可以基于URL的主机名、路径、路径前缀、shemes、请求头和请求值、HTTP请求方法类型或者使用自定义的匹配规则
  • URL主机名和路径可以通过一个正则表达式来表示
  • 注册的URL可以直接被运用,也可以保留从而保证维护资源的使用
  • 路由记录同样看可以作用于子路由记录

3.1.2. 添加路由记录

/api/server/server.go

if os.Getenv("DEBUG") != "" {
       AttachProfiler(r)
   }
 
   m := map[string]map[string]HttpApiFunc{
       "GET": {
           "/_ping":                          ping,
           "/events":                         getEvents,
           "/info":                           getInfo,
           "/version":                        getVersion,
           "/images/json":                    getImagesJSON,
           "/images/viz":                     getImagesViz,
           "/images/search":                  getImagesSearch,
           "/images/{name:.*}/get":           getImagesGet,
           "/images/{name:.*}/history":       getImagesHistory,
           "/images/{name:.*}/json":          getImagesByName,
           "/containers/ps":                  getContainersJSON,
           "/containers/json":                getContainersJSON,
           "/containers/{name:.*}/export":    getContainersExport,
           "/containers/{name:.*}/changes":   getContainersChanges,
           "/containers/{name:.*}/json":      getContainersByName,
           "/containers/{name:.*}/top":       getContainersTop,
           "/containers/{name:.*}/logs":      getContainersLogs,
           "/containers/{name:.*}/attach/ws": wsContainersAttach,
       },
       "POST": {
           "/auth":                         postAuth,
           "/commit":                       postCommit,
           "/build":                        postBuild,
           "/images/create":                postImagesCreate,
           "/images/load":                  postImagesLoad,
           "/images/{name:.*}/push":        postImagesPush,
           "/images/{name:.*}/tag":         postImagesTag,
           "/containers/create":            postContainersCreate,
           "/containers/{name:.*}/kill":    postContainersKill,
           "/containers/{name:.*}/pause":   postContainersPause,
           "/containers/{name:.*}/unpause": postContainersUnpause,
           "/containers/{name:.*}/restart": postContainersRestart,
           "/containers/{name:.*}/start":   postContainersStart,
           "/containers/{name:.*}/stop":    postContainersStop,
           "/containers/{name:.*}/wait":    postContainersWait,
           "/containers/{name:.*}/resize":  postContainersResize,
           "/containers/{name:.*}/attach":  postContainersAttach,
           "/containers/{name:.*}/copy":    postContainersCopy,
       },
       "DELETE": {
           "/containers/{name:.*}": deleteContainers,
           "/images/{name:.*}":     deleteImages,
       },
       "OPTIONS": {
           "": optionsHandler,
       },
   }

m的类型为映射,key表示HTTP的请求类型,如GET、POST、DELETE等,value为映射类型,代表URL与执行处理方法的映射。

/api/server/server.go

type HttpApiFunc func(eng *engine.Engine, version version.Version, w http.ResponseWriter, r *http.Request, vars map[string]string) error

3.2. 创建listener监听实例

路由模块完成请求的路由与分发,监听模块完成请求的监听功能。Listener是一种面向流协议的通用网络监听模块。

/api/server/server.go

var l net.Listener
 ...
 if job.GetenvBool("BufferRequests") {
     l, err = listenbuffer.NewListenBuffer(proto, addr, activationLock)
 } else {
     l, err = net.Listen(proto, addr)

Listenbuffer的作用:让Docker Server立即监听指定协议地址上的请求,但将这些请求暂时先缓存下来,等Docker Daemon全部启动完毕之后才让Docker Server开始接受这些请求。

/pkg/listenbuffer/buffer.go

// NewListenBuffer returns a listener listening on addr with the protocol.
func NewListenBuffer(proto, addr string, activate chan struct{}) (net.Listener, error) {
    wrapped, err := net.Listen(proto, addr)
    if err != nil {
        return nil, err
    }
 
    return &defaultListener{
        wrapped:  wrapped,
        activate: activate,
    }, nil
}

若协议类型为TCP,Job环境变量中Tls或TlsVerity有一个为true,则说明Docker Server需要支持HTTPS服务。需要建立一个tls.Config类型实例tlsConfig,在tlsConfig中加载证书、认证信息,通过tls包中的NewListener函数创建HTTPS协议请求的Listener实例。

/api/server/server.go

l = tls.NewListener(l, tlsConfig)

3.3. 创建http.Server

/api/server/server.go

httpSrv := http.Server{Addr: addr, Handler: r}

Docker Server需要创建一个Server对象来运行HTTP/HTTPS服务端,创建http.Server,addr为需要监听的地址,r为mux.Router。

3.4. 启动API服务

创建http.Server实例后,即启动API服务,监听请求,并对每一个请求生成一个新的协程来做专属服务。对于每个请求,协程会读取请求,查询路由表中的路由记录项,找到匹配的路由记录,最终调用路由记录中的处理方法,执行完毕返回响应信息。

/api/server/server.go

return httpSrv.Serve(l)

参考:

  • 《Docker源码分析》

12.3.5 -

1. Docker的总架构图

docker是一个C/S模式的架构,后端是一个松耦合架构,模块各司其职。

  1. 用户是使用Docker Client与Docker Daemon建立通信,并发送请求给后者。
  2. Docker Daemon作为Docker架构中的主体部分,首先提供Server的功能使其可以接受Docker Client的请求;
  3. Engine执行Docker内部的一系列工作,每一项工作都是以一个Job的形式的存在。
  4. Job的运行过程中,当需要容器镜像时,则从Docker Registry中下载镜像,并通过镜像管理驱动graphdriver将下载镜像以Graph的形式存储;
  5. 当需要为Docker创建网络环境时,通过网络管理驱动networkdriver创建并配置Docker容器网络环境;
  6. 当需要限制Docker容器运行资源或执行用户指令等操作时,则通过execdriver来完成。
  7. libcontainer是一项独立的容器管理包,networkdriver以及execdriver都是通过libcontainer来实现具体对容器进行的操作。

2. Docker各模块组件分析

2.1. Docker Client[发起请求]

  1. Docker Client是和Docker Daemon建立通信的客户端。用户使用的可执行文件为docker(类似可执行脚本的命令),docker命令后接参数的形式来实现一个完整的请求命令(例如docker images,docker为命令不可变,images为参数可变)。
  2. Docker Client可以通过以下三种方式和Docker Daemon建立通信:tcp://host:port,unix://path_to_socket和fd://socketfd。
  3. Docker Client发送容器管理请求后,由Docker Daemon接受并处理请求,当Docker Client接收到返回的请求相应并简单处理后,Docker Client一次完整的生命周期就结束了。[一次完整的请求:发送请求→处理请求→返回结果],与传统的C/S架构请求流程并无不同。

2.2. Docker Daemon[后台守护进程]

Docker Daemon的架构图

2.2.1. Docker Server[调度分发请求]

Docker Server的架构图

  1. Docker Server相当于C/S架构的服务端。功能为接受并调度分发Docker Client发送的请求。接受请求后,Server通过路由与分发调度,找到相应的Handler来执行请求。
  2. 在Docker的启动过程中,通过包gorilla/mux,创建了一个mux.Router,提供请求的路由功能。在Golang中,gorilla/mux是一个强大的URL路由器以及调度分发器。该mux.Router中添加了众多的路由项,每一个路由项由HTTP请求方法(PUT、POST、GET或DELETE)、URL、Handler三部分组成。
  3. 创建完mux.Router之后,Docker将Server的监听地址以及mux.Router作为参数,创建一个httpSrv=http.Server{},最终执行httpSrv.Serve()为请求服务。
  4. 在Server的服务过程中,Server在listener上接受Docker Client的访问请求,并创建一个全新的goroutine来服务该请求。在goroutine中,首先读取请求内容,然后做解析工作,接着找到相应的路由项,随后调用相应的Handler来处理该请求,最后Handler处理完请求之后回复该请求。

2.2.2. Engine

  1. Engine是Docker架构中的运行引擎,同时也Docker运行的核心模块。它扮演Docker container存储仓库的角色,并且通过执行job的方式来操纵管理这些容器。
  2. 在Engine数据结构的设计与实现过程中,有一个handler对象。该handler对象存储的都是关于众多特定job的handler处理访问。举例说明,Engine的handler对象中有一项为:{"create": daemon.ContainerCreate,},则说明当名为"create"的job在运行时,执行的是daemon.ContainerCreate的handler。

2.2.3. Job

  1. 一个Job可以认为是Docker架构中Engine内部最基本的工作执行单元。Docker可以做的每一项工作,都可以抽象为一个job。例如:在容器内部运行一个进程,这是一个job;创建一个新的容器,这是一个job。Docker Server的运行过程也是一个job,名为serveapi。
  2. Job的设计者,把Job设计得与Unix进程相仿。比如说:Job有一个名称,有参数,有环境变量,有标准的输入输出,有错误处理,有返回状态等。

2.3. Docker Registry[镜像注册中心]

  1. Docker Registry是一个存储容器镜像的仓库(注册中心),可理解为云端镜像仓库,按repository来分类,docker pull 按照[repository]:[tag]来精确定义一个image。
  2. 在Docker的运行过程中,Docker Daemon会与Docker Registry通信,并实现搜索镜像、下载镜像、上传镜像三个功能,这三个功能对应的job名称分别为"search","pull" 与 "push"。
  3. 可分为公有仓库(docker hub)和私有仓库。

2.4. Graph[docker内部数据库]

Graph的架构图

2.4.1. Repository

  1. 已下载镜像的保管者(包括下载镜像和dockerfile构建的镜像)。
  2. 一个repository表示某类镜像的仓库(例如Ubuntu),同一个repository内的镜像用tag来区分(表示同一类镜像的不同标签或版本)。一个registry包含多个repository,一个repository包含同类型的多个image。
  3. 镜像的存储类型有aufs,devicemapper,Btrfs,Vfs等。其中centos系统使用devicemapper的存储类型。
  4. 同时在Graph的本地目录中,关于每一个的容器镜像,具体存储的信息有:该容器镜像的元数据,容器镜像的大小信息,以及该容器镜像所代表的具体rootfs。

2.4.2. GraphDB

  1. 已下载容器镜像之间关系的记录者。
  2. GraphDB是一个构建在SQLite之上的小型图数据库,实现了节点的命名以及节点之间关联关系的记录

2.5. Driver[执行部分]

Driver是Docker架构中的驱动模块。通过Driver驱动,Docker可以实现对Docker容器执行环境的定制。即Graph负责镜像的存储,Driver负责容器的执行。

2.5.1. graphdriver

graphdriver架构图

  1. graphdriver主要用于完成容器镜像的管理,包括存储与获取。
  2. 存储:docker pull下载的镜像由graphdriver存储到本地的指定目录(Graph中)。
  3. 获取:docker run(create)用镜像来创建容器的时候由graphdriver到本地Graph中获取镜像。

2.5.2. networkdriver

networkdriver的架构图

  1. networkdriver的用途是完成Docker容器网络环境的配置,其中包括
    • Docker启动时为Docker环境创建网桥;
    • Docker容器创建时为其创建专属虚拟网卡设备;
    • Docker容器分配IP、端口并与宿主机做端口映射,设置容器防火墙策略等。

2.5.3. execdriver

execdriver的架构图

  1. execdriver作为Docker容器的执行驱动,负责创建容器运行命名空间,负责容器资源使用的统计与限制,负责容器内部进程的真正运行等。
  2. 现在execdriver默认使用native驱动,不依赖于LXC。

2.6. libcontainer[函数库]

libcontainer的架构图

  1. libcontainer是Docker架构中一个使用Go语言设计实现的库,设计初衷是希望该库可以不依靠任何依赖,直接访问内核中与容器相关的API。
  2. Docker可以直接调用libcontainer,而最终操纵容器的namespace、cgroups、apparmor、网络设备以及防火墙规则等。
  3. libcontainer提供了一整套标准的接口来满足上层对容器管理的需求。或者说,libcontainer屏蔽了Docker上层对容器的直接管理。

2.7. docker container[服务交付的最终形式]

container架构

  1. Docker container(Docker容器)是Docker架构中服务交付的最终体现形式。

  2. Docker按照用户的需求与指令,订制相应的Docker容器:

    • 用户通过指定容器镜像,使得Docker容器可以自定义rootfs等文件系统;
    • 用户通过指定计算资源的配额,使得Docker容器使用指定的计算资源;
    • 用户通过配置网络及其安全策略,使得Docker容器拥有独立且安全的网络环境;
    • 用户通过指定运行的命令,使得Docker容器执行指定的工作。

参考文章:

  • 《Docker源码分析》

12.3.6 -

1. 基本概念

1.1. image layer(镜像层)

镜像可以看成是由多个镜像层叠加起来的一个文件系统,镜像层也可以简单理解为一个基本的镜像,而每个镜像层之间通过指针的形式进行叠加。

1

根据上图,镜像层的主要组成部分包括镜像层id,镜像层指针【指向父层】,元数据【layer metadata】包含了docker构建和运行的信息还有父层的层次信息。

只读层和读写层【top layer】的组成部分基本一致。同时读写层可以转换成只读层【docker commit操作实现】

1.2. image(镜像)---【只读层的集合】

1、镜像是一堆只读层的统一视角,除了最底层没有指向外,每一层都指向它的父层,统一文件系统(union file system)技术能够将不同的层整合成一个文件系统,为这些层提供了一个统一的视角,这样就隐藏了多层的存在,在用户的角度看来,只存在一个文件系统。而每一层都是不可写的,就是只读层。

2.1

1.3. container(容器)---【一层读写层+多层只读层】

1、容器和镜像的区别在于容器的最上面一层是读写层【top layer】,而这边并没有区分容器是否在运行。运行状态的容器【running container】即一个可读写的文件系统【静态容器】+隔离的进程空间和其中的进程。

3.1

隔离的进程空间中的进程可以对该读写层进行增删改,其运行状态容器的进程操作都作用在该读写层上。每个容器只能有一个进程隔离空间。

3.2

2. Docker常用命令原理图概览:

3. Docker常用命令说明

3.1. 标识说明

3.1.1. image---(统一只读文件系统)

4.1.1

3.1.2. 静态容器【未运行的容器】---(统一可读写文件系统)

4.1.2

3.1.3. 动态容器【running container】---(进程空间(包括进程)+统一可读写文件系统)

4.1.3

3.2. 命令说明

3.2.1. docker生命周期相关命令:

3.2.1.1. docker create {image-id}

4.2.1.1

即为只读文件系统添加一层可读写层【top layer】,生成可读写文件系统,该命令状态下容器为静态容器,并没有运行。

3.2.1.2. docker start(restart) {container-id}

docker stop即为docker start的逆过程

4.2.1.2

即为可读写文件系统添加一个进程空间【包括进程】,生成动态容器【running container】

3.2.1.3. docker run {image-id}

4.2.1.3

docker run=docker create+docker start

类似流程如下 :

4.2.1.3.1

3.2.1.4. docker stop {container-id}

4.2.1.4

向运行的容器中发一个SIGTERM的信号,然后停止所有的进程。即为docker start的逆过程。

3.2.1.5. docker kill {container-id}

4.2.1.5

docker kill向容器发送不友好的SIGKILL的信号,相当于快速强制关闭容器,与docker stop的区别在于docker stop是正常关闭,先发SIGTERM信号,清理进程,再发SIGKILL信号退出。

3.2.1.6. docker pause {container-id}

docker unpause为逆过程---比较少使用

4.2.1.6

暂停容器中的所有进程,使用cgroup的freezer顺序暂停容器里的所有进程,docker unpause为逆过程即恢复所有进程。比较少使用。

3.2.1.7. docker commit {container-id}

4.2.1.7

4.2.1.7.2

把容器的可读写层转化成只读层,即从容器状态【可读写文件系统】变为镜像状态【只读文件系统】,可理解为【固化】。

3.2.1.8. docker build

4.2.1.8.1

4.2.1.8.2

docker build=docker run【运行容器】+【进程修改数据】+docker commit【固化数据】,不断循环直至生成所需镜像。

循环一次便会形成新的层(镜像)【原镜像层+已固化的可读写层】

docker build 一般作用在dockerfile文件上。

3.2.2. docker查询类命令

查询对象:①image,②container,③image/container中的数据,④系统信息[容器数,镜像数及其他]

3.2.2.1. Image

1、docker images

4.2.2.1.1

docker images 列出当前镜像【以顶层镜像id来表示整个完整镜像】,每个顶层镜像下面隐藏多个镜像层。

2、docker images -a

4.2.2.1.2

docker images -a列出所有镜像层【排序以每个顶层镜像id为首后接该镜像下的所有镜像层】,依次列出每个镜像的所有镜像层。

3、docker history {image-id}

4.2.2.1.3

docker history 列出该镜像id下的所有历史镜像。

3.2.2.2. Container

1、docker ps

4.2.2.2.1

列出所有运行的容器【running container】

2、docker ps -a

4.2.2.2.2

列出所有容器,包括静态容器【未运行的容器】和动态容器【running container】

3.2.2.3. Info

1、docker inspect {container-id} or {image-id}

4.2.2.3.1

提取出容器或镜像最顶层的元数据。

2、docker info

显示 Docker 系统信息,包括镜像和容器数。

3.2.3. docker操作类命令:

3.2.3.1. docker rm {container-id}

4.2.3.1

docker rm会移除镜像,该命令只能对静态容器【非运行状态】进行操作。

通过docker rm -f {container-id}的-f (force)参数可以强制删除运行状态的容器【running container】。

3.2.3.2. docker rmi {image-id}

4.2.3.2

3.2.3.3. docker exec {running-container-id}

4.2.3.3

docker exec会在运行状态的容器中执行一个新的进程。

3.2.3.4. docker export {container-id}

4.2.3.4

docker export命令创建一个tar文件,并且移除了元数据和不必要的层,将多个层整合成了一个层,只保存了当前统一视角看到的内容。

参考文章:

12.3.7 -

1. Dockerfile的说明

dockerfile指令忽略大小写,建议大写,#作为注释,每行只支持一条指令,指令可以带多个参数。

dockerfile指令分为构建指令和设置指令。

  1. 构建指令:用于构建image,其指定的操作不会在运行image的容器中执行。
  2. 设置指令:用于设置image的属性,其指定的操作会在运行image的容器中执行。

2. Dockerfile指令说明

2.1. FROM(指定基础镜像)[构建指令]

该命令用来指定基础镜像,在基础镜像的基础上修改数据从而构建新的镜像。基础镜像可以是本地仓库也可以是远程仓库。

指令有两种格式:

  1. FROM image 【默认为latest版本】
  2. FROM image:tag 【指定版本】

2.2. MAINTAINER(镜像创建者信息)[构建指令]

将镜像制作者(维护者)的信息写入image中,执行docker inspect时会输出该信息。

格式:MAINTAINER name

MAINTAINER命令已废弃,可使用maintainer label的方式。

LABEL maintainer="SvenDowideit@home.org.au"

2.3. RUN(安装软件用)[构建指令]

RUN可以运行任何被基础镜像支持的命令(即在基础镜像上执行一个进程),可以使用多条RUN指令,指令较长可以使用\来换行。

指令有两种格式:

  1. RUN command (the command is run in a shell - /bin/sh -c)
  2. RUN ["executable", "param1", "param2" ... ] (exec form)
    • 指定使用其他终端实现,使用exec执行。
    • 例子:RUN["/bin/bash","-c","echo hello"]

2.4. CMD(设置container启动时执行的操作)[设置指令]

用于容器启动时的指定操作,可以是自定义脚本或命令,只执行一次,多个默认执行最后一个。

指令有三种格式:

  1. CMD ["executable","param1","param2"] (like an exec, this is the preferred form)
    • 运行一个可执行文件并提供参数。
  2. CMD command param1 param2 (as a shell)
    • 直接执行shell命令,默认以/bin/sh -c执行。
  3. CMD ["param1","param2"] (as default parameters to ENTRYPOINT)
    • 和ENTRYPOINT配合使用,只作为完整命令的参数部分。

2.5. ENTRYPOINT(设置container启动时执行的操作)[设置指令]

指定容器启动时执行的命令,若多次设置只执行最后一次。

ENTRYPOINT翻译为“进入点”,它的功能可以让容器表现得像一个可执行程序一样。

例子:ENTRYPOINT ["/bin/echo"] ,那么docker build出来的镜像以后的容器功能就像一个/bin/echo程序,docker run -it imageecho “this is a test”,就会输出对应的字符串。这个imageecho镜像对应的容器表现出来的功能就像一个echo程序一样。

指令有两种格式:

  1. ENTRYPOINT ["executable", "param1", "param2"] (like an exec, the preferred form)

    • 和CMD配合使用,CMD则作为完整命令的参数部分,ENTRYPOINT以JSON格式指定执行的命令部分。CMD可以为ENTRYPOINT提供可变参数,不需要变动的参数可以写在ENTRYPOINT里面。

    • 例子:

      ENTRYPOINT ["/usr/bin/ls","-a"]

      CMD ["-l"]

  2. ENTRYPOINT command param1 param2 (as a shell)

    • 独自使用,即和CMD类似,如果CMD也是个完整命令[CMD command param1 param2 (as a shell) ],那么会相互覆盖,只执行最后一个CMD或ENTRYPOINT。
    • 例子:ENTRYPOINT ls -l

2.6. USER(设置container容器启动的登录用户)[设置指令]

设置启动容器的用户,默认为root用户。

格式:USER daemon

2.7. EXPOSE(指定容器需要映射到宿主机的端口)[设置指令]

该指令会将容器中的端口映射为宿主机中的端口[确保宿主机的端口号没有被使用]。通过宿主机IP和映射后的端口即可访问容器[避免每次运行容器时IP随机生成不固定的问题]。前提是EXPOSE设置映射端口,运行容器时加上-p参数指定EXPOSE设置的端口。EXPOSE可以设置多个端口号,相应地运行容器配套多次使用-p参数。可以通过docker port +容器需要映射的端口号和容器ID来参考宿主机的映射端口。

格式:EXPOSE port [port...]

2.8. ENV(用于设置环境变量)[构建指令]

在image中设置环境变量[以键值对的形式],设置之后RUN命令可以使用该环境变量,在容器启动后也可以通过docker inspect查看环境变量或者通过 docker run --env key=value设置或修改环境变量。

格式:ENV key value

例子:ENV JAVA_HOME /path/to/java/dirent

2.9. ARG(用于设置变量)[构建指令]

ARG定义一个默认参数,可以在dockerfile中引用。构建阶段可以通过docker build --build-arg =参数向dockerfile文件中传入参数。

ARG <arg_name>[=<default value>]
# 可以搭配ENV使用
ENV env_name ${arg_name}

示例:

docker build --build-arg user=what_user .

2.10. ADD(从src复制文件到container的dest路径)[构建指令]

复制指定的src到容器中的dest,其中src是相对被构建的源目录的相对路径,可以是文件或目录的路径,也可以是一个远程的文件url。dest 是container中的绝对路径。所有拷贝到container中的文件和文件夹权限为0755,uid和gid为0。

  • 如果src是一个目录,那么会将该目录下的所有文件添加到container中,不包括目录;
  • 如果src文件是可识别的压缩格式,则docker会帮忙解压缩(注意压缩格式);
  • 如果src是文件且dest中不使用斜杠结束,则会将dest视为文件,src的内容会写入dest
  • 如果src是文件且dest中使用斜杠结束,则会src文件拷贝到dest目录下。

格式:ADD src dest

为避免 ADD命令带来的未知风险和复杂性,可以使用COPY命令替代ADD命令

2.11. COPY(复制文件)

复制本地主机的src为容器中的dest,目标路径不存在时会自动创建。

格式:COPY src dest

2.12. VOLUME(指定挂载点)[设置指令]

创建一个可以从本地主机或其他容器挂载的挂载点,使容器中的一个目录具有持久化存储数据的功能,该目录可以被容器本身使用也可以被其他容器使用。

格式:VOLUME ["mountpoint"]

其他容器使用共享数据卷:docker run -t -i -rm -volumes-from container1 image2 bash [container1为第一个容器的ID,image2为第二个容器运行image的名字。]

2.13. WORKDIR(切换目录)[设置指令]

相当于cd命令,可以多次切换目录,为RUN,CMD,ENTRYPOINT配置工作目录。可以使用多个WORKDIR的命令,后续命令如果是相对路径则是在上一级路径的基础上执行[类似cd的功能]。

格式:WORKDIR /path/to/workdir

2.14. ONBUILD(在子镜像中执行)

当所创建的镜像作为其他新创建镜像的基础镜像时执行的操作命令,即在创建本镜像时不运行,当作为别人的基础镜像时再在构建时运行(可认为基础镜像为父镜像,而该命令即在它的子镜像构建时运行,相当于在子镜像构建时多加了一些命令)。

格式:ONBUILD Dockerfile关键字

3. dockerfile示例

最佳实践

  • 镜像可以分为三层:系统基础镜像、业务基础镜像、业务镜像。
  • 尽量将不变的镜像操作放dockerfile前面。
  • 一类RUN命令操作可以通过\&&方式组合成一条RUN命令。
  • dockerfile尽量清晰简洁。

文件目录

./
|-- Dockerfile
|-- docker-entrypoint.sh
|-- dumb-init
|-- conf    # 配置文件路径
|   `-- app_conf.py  
|-- pkg   # 安装包路径
|   `-- install.tar.gz
|-- run.sh  # 启动脚本

dockerfile示例

FROM centos:latest
LABEL maintainer="xxx@xxx.com"

ARG APP=appname
ENV APP ${APP}

# copy and install app 
COPY conf/app_conf.py /usr/local/app/app_conf/app_conf.py
COPY pkg/${APP}-*-install.tar.gz /data/${APP}-install.tar.gz
RUN mkdir -p /data/${APP} \
    && tar -zxvf /data/${APP}-install.tar.gz -C /data/${APP} \
    && cd /data/${APP}/${APP}* \
    && ./install.sh

WORKDIR /usr/local/app/

# init
COPY dumb-init /usr/bin/dumb-init
COPY docker-entrypoint.sh /docker-entrypoint.sh
ENTRYPOINT ["/usr/bin/dumb-init", "--","/docker-entrypoint.sh"]

COPY run.sh /run.sh
RUN chmod +x /run.sh
CMD ["/run.sh"]

4. docker build

指定dockerfile文件构建

默认不指定dockerfile文件名,则读取指定路径的Dockerfile

docker build -t <image_name> -f <dockerfile_name> <dockerfile_path>

docker build --help

docker build --help

Usage:	docker build [OPTIONS] PATH | URL | -

Build an image from a Dockerfile

Options:
      --add-host list           Add a custom host-to-IP mapping (host:ip)
      --build-arg list          Set build-time variables
      --cache-from strings      Images to consider as cache sources
      --cgroup-parent string    Optional parent cgroup for the container
      --compress                Compress the build context using gzip
      --cpu-period int          Limit the CPU CFS (Completely Fair Scheduler) period
      --cpu-quota int           Limit the CPU CFS (Completely Fair Scheduler) quota
  -c, --cpu-shares int          CPU shares (relative weight)
      --cpuset-cpus string      CPUs in which to allow execution (0-3, 0,1)
      --cpuset-mems string      MEMs in which to allow execution (0-3, 0,1)
      --disable-content-trust   Skip image verification (default true)
  -f, --file string             Name of the Dockerfile (Default is 'PATH/Dockerfile')
      --force-rm                Always remove intermediate containers
      --iidfile string          Write the image ID to the file
      --isolation string        Container isolation technology
      --label list              Set metadata for an image
  -m, --memory bytes            Memory limit
      --memory-swap bytes       Swap limit equal to memory plus swap: '-1' to enable unlimited swap
      --network string          Set the networking mode for the RUN instructions during build (default "default")
      --no-cache                Do not use cache when building the image
      --pull                    Always attempt to pull a newer version of the image
  -q, --quiet                   Suppress the build output and print image ID on success
      --rm                      Remove intermediate containers after a successful build (default true)
      --security-opt strings    Security options
      --shm-size bytes          Size of /dev/shm
  -t, --tag list                Name and optionally a tag in the 'name:tag' format
      --target string           Set the target build stage to build.
      --ulimit ulimit           Ulimit options (default [])

参考:

12.3.8 -

1. CentOS 安装Docker

建议使用centos7

1.1. 安装Docker

1.1.1. 卸载旧版本

旧版本的Docker命名为dockerdocker-engine,如果有安装旧版本,先卸载旧版本

$ sudo yum remove -y docker \
                  docker-client \
                  docker-client-latest \
                  docker-common \
                  docker-latest \
                  docker-latest-logrotate \
                  docker-logrotate \
                  docker-selinux \
                  docker-engine-selinux \
                  docker-engine

1.1.2. 使用仓库安装

1、安装yum-utils、device-mapper-persistent-data、lvm2

$ sudo yum install -y yum-utils \
  device-mapper-persistent-data \
  lvm2

2、添加软件源

$ sudo yum-config-manager \
    --add-repo \
    https://download.docker.com/linux/centos/docker-ce.repo

1.1.3. 安装Docker

安装最新版本的Docker CE。

$ sudo yum install -y docker-ce 

1.1.4. 启动Docker

# 启动Docker
$ sudo systemctl start docker
# 运行容器
$ sudo docker run hello-world

1.2. 安装指定版本Docker

1、列出可安装版本

$ yum list docker-ce --showduplicates | sort -r

docker-ce.x86_64            18.03.0.ce-1.el7.centos             docker-ce-stable

2、安装指定版本

例如:docker-ce-18.03.0.ce

$ sudo yum install docker-ce-<VERSION STRING>

1.3. 升级Docker

依据1.2的方法选择指定版本安装。

1.4. 卸载Docker

# 卸载Docker
$ sudo yum remove docker-ce

# 清理镜像、容器、存储卷等
$ sudo rm -rf /var/lib/docker

2. Ubuntu 安装Docker

2.1. 安装Docker

2.1.1. 卸载旧版本

旧版本的Docker命名为dockerdocker-engine,如果有安装旧版本,先卸载旧版本

sudo apt-get remove docker docker-engine docker.io

2.1.2. 使用仓库安装

1、升级apt

sudo apt-get update

2、允许apt使用https

sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common

3、添加Docker 官方的GPG密钥

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -

4、添加Docker软件源

sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"

2.1.3. 安装Docker

# update
sudo apt-get update

# install docker
sudo apt-get install docker-ce

2.1.4. 启动Docker

# 设置为开机启动
sudo systemctl enable docker
# 启动docker
sudo systemctl start docker

2.2. 安装指定版本Docker

1、列出仓库的可安装版本,apt-cache madison docker-ce

# apt-cache madison docker-ce
 docker-ce | 18.06.0~ce~3-0~ubuntu | https://download.docker.com/linux/ubuntu bionic/stable amd64 Packages
 docker-ce | 18.03.1~ce~3-0~ubuntu | https://download.docker.com/linux/ubuntu bionic/stable amd64 Packages

2、指定版本安装

例如:docker-ce=18.03.0~ce-0~ubuntu

sudo apt-get install docker-ce=<VERSION>

2.3. 升级Docker

# 更新源
sudo apt-get update
# 依据上述方法,指定版本安装

2.4. 卸载Docker

# 卸载 docker ce
sudo apt-get purge docker-ce

# 清理镜像、容器、存储卷等
sudo rm -rf /var/lib/docker

3. 离线rpm包安装Docker

3.1. 下载docker rpm包

rpm包地址:https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/

下载指定版本的containerd.io、docker-ce、docker-ce-cli

wget https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/containerd.io-1.2.6-3.3.el7.x86_64.rpm
wget https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/docker-ce-18.09.9-3.el7.x86_64.rpm
wget https://mirrors.aliyun.com/docker-ce/linux/centos/7/x86_64/stable/Packages/docker-ce-cli-18.09.9-3.el7.x86_64.rpm

下载container-selinux

地址:http://mirror.centos.org/centos/7/extras/x86_64/Packages/

wget http://mirror.centos.org/centos/7/extras/x86_64/Packages/container-selinux-2.107-3.el7.noarch.rpm

3.2. 安装rpm包

# container-selinux
rpm -ivh container-selinux*.rpm
# containerd.io
rpm -ivh containerd.io*.rpm
# docker-ce
rpm -ivh docker-ce*.rpm
# docker-ce-cli
rpm -ivh docker-ce-cli*.rpm

3.3. 启动docker服务

# 启动
systemctl start docker
# 查看状态
systemctl status docker

文章参考:

12.4 - Kata Container

12.4.1 - kata容器简介

Kata-container简介

kata-container通过轻量型虚拟机技术构建一个安全的容器运行时,表现像容器一样,但通硬件虚拟化技术提供强隔离,作为第二层的安全防护。

特点:

  • 安全:独立的内核,提供网络、I/O、内存的隔离。
  • 兼容性:支持OCI容器标准,k8s的CRI接口。
  • 性能:兼容虚拟机的安全和容器的轻量特点。
  • 简单:使用标准的接口。

1. kata-container架构

kata-container与传统container的比较

2. kata-runtime

Kata Containers runtime (kata-runtime)通过QEMU*/KVM技术创建了一种轻量型的虚拟机,兼容 OCI runtime specification 标准,支持Kubernetes* Container Runtime Interface (CRI)接口,可替换CRI shim runtime (runc) 通过k8s来创建pod或容器。

3. shim

shim类似Docker的 containerd-shim 或CRI-O的 conmon,主要用来监控和回收容器的进程,kata-shim需要处理所有的容器的IO流(stdout, stdin and stderr)和转发相关信号。

containerd-shim-kata-v2实现了Containerd Runtime V2 (Shim API),k8s可以通过containerd-shim-kata-v2(替代2N+1shims[由一个containerd-shimkata-shim组成])来创建pod。

4. kata-agent

在虚拟机内kata-agent作为一个daemon进程运行,并拉起容器的进程。kata-agent使用VIRTIO或VSOCK接口(QEMU在主机上暴露的socket文件)在guest虚拟机中运行gRPC服务器。kata-runtime通过grpc协议与kata-agent通信,向kata-agent发送管理容器的命令。该协议还用于容器和管理引擎(例如Docker Engine)之间传送I / O流(stdout,stderr,stdin)。

容器内所有的执行命令和相关的IO流都需要通过QEMU在宿主机暴露的virtio-serialvsock接口,当使用VIRTIO的情况下,每个虚拟机会创建一个Kata Containers proxy (kata-proxy) 来处理命令和IO流。

kata-agent使用libcontainer 来管理容器的生命周期,复用了runc的部分代码。

5. kata-proxy

kata-proxy提供了 kata-shimkata-runtime 与VM中的kata-agent通信的方式,其中通信方式是使用virtio-serialvsock,默认是使用virtio-serial

6. Hypervisor

kata-container通过QEMU/KVM来创建虚拟机给容器运行,可以支持多种hypervisors。

7. QEMU/KVM

待补充

参考文档:

12.4.2 - kata配置

1. 配置文件路径

默认的配置文件位于/usr/share/defaults/kata-containers/configuration.toml,如果/etc/kata-containers/configuration.toml的配置文件存在,则会替代默认的配置文件。

查看配置文件的路径命令如下:

# kata-runtime --kata-show-default-config-paths
/etc/kata-containers/configuration.toml
/usr/share/defaults/kata-containers/configuration.toml

指定自定义配置文件运行

kata-runtime --kata-config=/some/where/configuration.toml ...

2. kata-env

查看runtime使用到的环境参数,

kata-runtime kata-env

输出内容如下:

[Meta]
  Version = "1.0.23"

[Runtime]
  Debug = false
  Trace = false
  DisableGuestSeccomp = true
  DisableNewNetNs = false
  Path = "/usr/bin/kata-runtime"
  [Runtime.Version]
    Semver = "1.7.2"
    Commit = "9b9282693cfbcf70d442916bea56771cc6fc3afe"
    OCI = "1.0.1-dev"
  [Runtime.Config]
    Path = "/usr/share/defaults/kata-containers/configuration.toml"

[Hypervisor]
  MachineType = "pc"
  Version = "QEMU emulator version 2.11.0\nCopyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers"
  Path = "/usr/bin/qemu-lite-system-x86_64"
  BlockDeviceDriver = "virtio-scsi"
  EntropySource = "/dev/urandom"
  Msize9p = 8192
  MemorySlots = 10
  Debug = false
  UseVSock = false
  SharedFS = "virtio-9p"

[Image]
  Path = "/usr/share/kata-containers/kata-containers-image_centos_1.7.2_agent_20190702.img"

[Kernel]
  Path = "/usr/share/kata-containers/vmlinuz-4.19.28.42-6.1.container"
  Parameters = "init=/usr/lib/systemd/systemd systemd.unit=kata-containers.target systemd.mask=systemd-networkd.service systemd.mask=systemd-networkd.socket systemd.mask=systemd-journald.service systemd.mask=systemd-journald.socket systemd.mask=systemd-journal-flush.service systemd.mask=systemd-journald-dev-log.socket systemd.mask=systemd-udevd.service systemd.mask=systemd-udevd.socket systemd.mask=systemd-udev-trigger.service systemd.mask=systemd-udevd-kernel.socket systemd.mask=systemd-udevd-control.socket systemd.mask=systemd-timesyncd.service systemd.mask=systemd-update-utmp.service systemd.mask=systemd-tmpfiles-setup.service systemd.mask=systemd-tmpfiles-cleanup.service systemd.mask=systemd-tmpfiles-cleanup.timer systemd.mask=tmp.mount systemd.mask=systemd-random-seed.service systemd.mask=systemd-coredump@.service"

[Initrd]
  Path = ""

[Proxy]
  Type = "kataProxy"
  Version = "kata-proxy version 1.7.2-a56df7c"
  Path = "/usr/libexec/kata-containers/kata-proxy"
  Debug = false

[Shim]
  Type = "kataShim"
  Version = "kata-shim version 1.7.2-2ea178c"
  Path = "/usr/libexec/kata-containers/kata-shim"
  Debug = false

[Agent]
  Type = "kata"
  Debug = false
  Trace = false
  TraceMode = ""
  TraceType = ""

[Host]
  Kernel = "4.14.105-1-tlinux3-0008"
  Architecture = "amd64"
  VMContainerCapable = true
  SupportVSocks = true
  [Host.Distro]
    Name = "Tencent tlinux"
    Version = "2.2"
  [Host.CPU]
    Vendor = "GenuineIntel"
    Model = "Intel(R) Xeon(R) CPU           X3440  @ 2.53GHz"

[Netmon]
  Version = "kata-netmon version 1.7.2"
  Path = "/usr/libexec/kata-containers/kata-netmon"
  Debug = false
  Enable = false

3. configuration.toml

# Copyright (c) 2017-2019 Intel Corporation
#
# SPDX-License-Identifier: Apache-2.0
#

# XXX: WARNING: this file is auto-generated.
# XXX:
# XXX: Source file: "cli/config/configuration-qemu.toml.in"
# XXX: Project:
# XXX:   Name: Kata Containers
# XXX:   Type: kata

[hypervisor.qemu]
path = "/usr/bin/qemu-lite-system-x86_64"
kernel = "/usr/share/kata-containers/vmlinuz.container"
image = "/usr/share/kata-containers/kata-containers.img"
machine_type = "pc"

# Optional space-separated list of options to pass to the guest kernel.
# For example, use `kernel_params = "vsyscall=emulate"` if you are having
# trouble running pre-2.15 glibc.
#
# WARNING: - any parameter specified here will take priority over the default
# parameter value of the same name used to start the virtual machine.
# Do not set values here unless you understand the impact of doing so as you
# may stop the virtual machine from booting.
# To see the list of default parameters, enable hypervisor debug, create a
# container and look for 'default-kernel-parameters' log entries.
kernel_params = ""

# Path to the firmware.
# If you want that qemu uses the default firmware leave this option empty
firmware = ""

# Machine accelerators
# comma-separated list of machine accelerators to pass to the hypervisor.
# For example, `machine_accelerators = "nosmm,nosmbus,nosata,nopit,static-prt,nofw"`
machine_accelerators=""

# Default number of vCPUs per SB/VM:
# unspecified or 0                --> will be set to 1
# < 0                             --> will be set to the actual number of physical cores
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores      --> will be set to the actual number of physical cores
default_vcpus = 1

# Default maximum number of vCPUs per SB/VM:
# unspecified or == 0             --> will be set to the actual number of physical cores or to the maximum number
#                                     of vCPUs supported by KVM if that number is exceeded
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores      --> will be set to the actual number of physical cores or to the maximum number
#                                     of vCPUs supported by KVM if that number is exceeded
# WARNING: Depending of the architecture, the maximum number of vCPUs supported by KVM is used when
# the actual number of physical cores is greater than it.
# WARNING: Be aware that this value impacts the virtual machine's memory footprint and CPU
# the hotplug functionality. For example, `default_maxvcpus = 240` specifies that until 240 vCPUs
# can be added to a SB/VM, but the memory footprint will be big. Another example, with
# `default_maxvcpus = 8` the memory footprint will be small, but 8 will be the maximum number of
# vCPUs supported by the SB/VM. In general, we recommend that you do not edit this variable,
# unless you know what are you doing.
default_maxvcpus = 0

# Bridges can be used to hot plug devices.
# Limitations:
# * Currently only pci bridges are supported
# * Until 30 devices per bridge can be hot plugged.
# * Until 5 PCI bridges can be cold plugged per VM.
#   This limitation could be a bug in qemu or in the kernel
# Default number of bridges per SB/VM:
# unspecified or 0   --> will be set to 1
# > 1 <= 5           --> will be set to the specified number
# > 5                --> will be set to 5
default_bridges = 1

# Default memory size in MiB for SB/VM.
# If unspecified then it will be set 2048 MiB.
default_memory = 2048
#
# Default memory slots per SB/VM.
# If unspecified then it will be set 10.
# This is will determine the times that memory will be hotadded to sandbox/VM.
#memory_slots = 10

# The size in MiB will be plused to max memory of hypervisor.
# It is the memory address space for the NVDIMM devie.
# If set block storage driver (block_device_driver) to "nvdimm",
# should set memory_offset to the size of block device.
# Default 0
#memory_offset = 0

# Disable block device from being used for a container's rootfs.
# In case of a storage driver like devicemapper where a container's
# root file system is backed by a block device, the block device is passed
# directly to the hypervisor for performance reasons.
# This flag prevents the block device from being passed to the hypervisor,
# 9pfs is used instead to pass the rootfs.
disable_block_device_use = false

# Shared file system type:
#   - virtio-9p (default)
#   - virtio-fs
shared_fs = "virtio-9p"

# Path to vhost-user-fs daemon.
virtio_fs_daemon = "/usr/bin/virtiofsd-x86_64"

# Default size of DAX cache in MiB
virtio_fs_cache_size = 1024

# Cache mode:
#
#  - none
#    Metadata, data, and pathname lookup are not cached in guest. They are
#    always fetched from host and any changes are immediately pushed to host.
#
#  - auto
#    Metadata and pathname lookup cache expires after a configured amount of
#    time (default is 1 second). Data is cached while the file is open (close
#    to open consistency).
#
#  - always
#    Metadata, data, and pathname lookup are cached in guest and never expire.
virtio_fs_cache = "always"

# Block storage driver to be used for the hypervisor in case the container
# rootfs is backed by a block device. This is virtio-scsi, virtio-blk
# or nvdimm.
block_device_driver = "virtio-scsi"

# Specifies cache-related options will be set to block devices or not.
# Default false
#block_device_cache_set = true

# Specifies cache-related options for block devices.
# Denotes whether use of O_DIRECT (bypass the host page cache) is enabled.
# Default false
#block_device_cache_direct = true

# Specifies cache-related options for block devices.
# Denotes whether flush requests for the device are ignored.
# Default false
#block_device_cache_noflush = true

# Enable iothreads (data-plane) to be used. This causes IO to be
# handled in a separate IO thread. This is currently only implemented
# for SCSI.
#
enable_iothreads = false

# Enable pre allocation of VM RAM, default false
# Enabling this will result in lower container density
# as all of the memory will be allocated and locked
# This is useful when you want to reserve all the memory
# upfront or in the cases where you want memory latencies
# to be very predictable
# Default false
#enable_mem_prealloc = true

# Enable huge pages for VM RAM, default false
# Enabling this will result in the VM memory
# being allocated using huge pages.
# This is useful when you want to use vhost-user network
# stacks within the container. This will automatically
# result in memory pre allocation
#enable_hugepages = true

# Enable swap of vm memory. Default false.
# The behaviour is undefined if mem_prealloc is also set to true
#enable_swap = true

# This option changes the default hypervisor and kernel parameters
# to enable debug output where available. This extra output is added
# to the proxy logs, but only when proxy debug is also enabled.
#
# Default false
#enable_debug = true

# Disable the customizations done in the runtime when it detects
# that it is running on top a VMM. This will result in the runtime
# behaving as it would when running on bare metal.
#
#disable_nesting_checks = true

# This is the msize used for 9p shares. It is the number of bytes
# used for 9p packet payload.
#msize_9p = 8192

# If true and vsocks are supported, use vsocks to communicate directly
# with the agent and no proxy is started, otherwise use unix
# sockets and start a proxy to communicate with the agent.
# Default false
#use_vsock = true

# VFIO devices are hotplugged on a bridge by default.
# Enable hotplugging on root bus. This may be required for devices with
# a large PCI bar, as this is a current limitation with hotplugging on
# a bridge. This value is valid for "pc" machine type.
# Default false
#hotplug_vfio_on_root_bus = true

# If host doesn't support vhost_net, set to true. Thus we won't create vhost fds for nics.
# Default false
#disable_vhost_net = true
#
# Default entropy source.
# The path to a host source of entropy (including a real hardware RNG)
# /dev/urandom and /dev/random are two main options.
# Be aware that /dev/random is a blocking source of entropy.  If the host
# runs out of entropy, the VMs boot time will increase leading to get startup
# timeouts.
# The source of entropy /dev/urandom is non-blocking and provides a
# generally acceptable source of entropy. It should work well for pretty much
# all practical purposes.
#entropy_source= "/dev/urandom"

# Path to OCI hook binaries in the *guest rootfs*.
# This does not affect host-side hooks which must instead be added to
# the OCI spec passed to the runtime.
#
# You can create a rootfs with hooks by customizing the osbuilder scripts:
# https://github.com/kata-containers/osbuilder
#
# Hooks must be stored in a subdirectory of guest_hook_path according to their
# hook type, i.e. "guest_hook_path/{prestart,postart,poststop}".
# The agent will scan these directories for executable files and add them, in
# lexicographical order, to the lifecycle of the guest container.
# Hooks are executed in the runtime namespace of the guest. See the official documentation:
# https://github.com/opencontainers/runtime-spec/blob/v1.0.1/config.md#posix-platform-hooks
# Warnings will be logged if any error is encountered will scanning for hooks,
# but it will not abort container execution.
#guest_hook_path = "/usr/share/oci/hooks"

[factory]
# VM templating support. Once enabled, new VMs are created from template
# using vm cloning. They will share the same initial kernel, initramfs and
# agent memory by mapping it readonly. It helps speeding up new container
# creation and saves a lot of memory if there are many kata containers running
# on the same host.
#
# When disabled, new VMs are created from scratch.
#
# Note: Requires "initrd=" to be set ("image=" is not supported).
#
# Default false
#enable_template = true

# Specifies the path of template.
#
# Default "/run/vc/vm/template"
#template_path = "/run/vc/vm/template"

# The number of caches of VMCache:
# unspecified or == 0   --> VMCache is disabled
# > 0                   --> will be set to the specified number
#
# VMCache is a function that creates VMs as caches before using it.
# It helps speed up new container creation.
# The function consists of a server and some clients communicating
# through Unix socket.  The protocol is gRPC in protocols/cache/cache.proto.
# The VMCache server will create some VMs and cache them by factory cache.
# It will convert the VM to gRPC format and transport it when gets
# requestion from clients.
# Factory grpccache is the VMCache client.  It will request gRPC format
# VM and convert it back to a VM.  If VMCache function is enabled,
# kata-runtime will request VM from factory grpccache when it creates
# a new sandbox.
#
# Default 0
#vm_cache_number = 0

# Specify the address of the Unix socket that is used by VMCache.
#
# Default /var/run/kata-containers/cache.sock
#vm_cache_endpoint = "/var/run/kata-containers/cache.sock"

[proxy.kata]
path = "/usr/libexec/kata-containers/kata-proxy"

# If enabled, proxy messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[shim.kata]
path = "/usr/libexec/kata-containers/kata-shim"

# If enabled, shim messages will be sent to the system log
# (default: disabled)
#enable_debug = true

# If enabled, the shim will create opentracing.io traces and spans.
# (See https://www.jaegertracing.io/docs/getting-started).
#
# Note: By default, the shim runs in a separate network namespace. Therefore,
# to allow it to send trace details to the Jaeger agent running on the host,
# it is necessary to set 'disable_new_netns=true' so that it runs in the host
# network namespace.
#
# (default: disabled)
#enable_tracing = true

[agent.kata]
# If enabled, make the agent display debug-level messages.
# (default: disabled)
#enable_debug = true

# Enable agent tracing.
#
# If enabled, the default trace mode is "dynamic" and the
# default trace type is "isolated". The trace mode and type are set
# explicity with the `trace_type=` and `trace_mode=` options.
#
# Notes:
#
# - Tracing is ONLY enabled when `enable_tracing` is set: explicitly
#   setting `trace_mode=` and/or `trace_type=` without setting `enable_tracing`
#   will NOT activate agent tracing.
#
# - See https://github.com/kata-containers/agent/blob/master/TRACING.md for
#   full details.
#
# (default: disabled)
#enable_tracing = true
#
#trace_mode = "dynamic"
#trace_type = "isolated"

[netmon]
# If enabled, the network monitoring process gets started when the
# sandbox is created. This allows for the detection of some additional
# network being added to the existing network namespace, after the
# sandbox has been created.
# (default: disabled)
#enable_netmon = true

# Specify the path to the netmon binary.
path = "/usr/libexec/kata-containers/kata-netmon"

# If enabled, netmon messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[runtime]
# If enabled, the runtime will log additional debug messages to the
# system log
# (default: disabled)
#enable_debug = true
#
# Internetworking model
# Determines how the VM should be connected to the
# the container network interface
# Options:
#
#   - bridged
#     Uses a linux bridge to interconnect the container interface to
#     the VM. Works for most cases except macvlan and ipvlan.
#
#   - macvtap
#     Used when the Container network interface can be bridged using
#     macvtap.
#
#   - none
#     Used when customize network. Only creates a tap device. No veth pair.
#
#   - tcfilter
#     Uses tc filter rules to redirect traffic from the network interface
#     provided by plugin to a tap interface connected to the VM.
#
internetworking_model="tcfilter"

# disable guest seccomp
# Determines whether container seccomp profiles are passed to the virtual
# machine and applied by the kata agent. If set to true, seccomp is not applied
# within the guest
# (default: true)
disable_guest_seccomp=true

# If enabled, the runtime will create opentracing.io traces and spans.
# (See https://www.jaegertracing.io/docs/getting-started).
# (default: disabled)
#enable_tracing = true

# If enabled, the runtime will not create a network namespace for shim and hypervisor processes.
# This option may have some potential impacts to your host. It should only be used when you know what you're doing.
# `disable_new_netns` conflicts with `enable_netmon`
# `disable_new_netns` conflicts with `internetworking_model=bridged` and `internetworking_model=macvtap`. It works only
# with `internetworking_model=none`. The tap device will be in the host network namespace and can connect to a bridge
# (like OVS) directly.
# If you are using docker, `disable_new_netns` only works with `docker run --net=none`
# (default: false)
#disable_new_netns = true

# Enabled experimental feature list, format: ["a", "b"].
# Experimental features are features not stable enough for production,
# They may break compatibility, and are prepared for a big version bump.
# Supported experimental features:
# 1. "newstore": new persist storage driver which breaks backward compatibility,
#				expected to move out of experimental in 2.0.0.
# (default: [])
experimental=[]

参考:

12.5 - GPU

12.5.1 - nvidia-device-plugin介绍

1. 简介

NVIDIA device plugin 通过k8s daemonset的方式部署到每个k8s的node节点上,实现了Kubernetes device plugin的接口。

提供以下功能:

  • 暴露每个节点的GPU数量给集群
  • 跟踪GPU的健康情况
  • 使在k8s的节点可以运行GPU容器

2. 要求

  • NVIDIA drivers ~= 384.81
  • nvidia-docker version > 2.0 (see how to install and it's prerequisites)
  • docker configured with nvidia as the default runtime.
  • Kubernetes version >= 1.10

3. 使用

3.1. 安装NVIDIA drivers和nvidia-docker

提供GPU节点的机器,准备工作如下

  1. 安装NVIDIA drivers ~= 384.81
  2. 安装nvidia-docker version > 2.0

3.2. 配置docker runtime

配置nvidia runtime作为GPU节点的默认runtime。

修改文件/etc/docker/daemon.json,增加以下runtime内容。

{
    "default-runtime": "nvidia",
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    }
}

3.3. 部署nvidia-device-plugin

$ kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml

nvidia-device-plugin的daemonset yaml文件如下:

# Copyright (c) 2019, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: nvidia-device-plugin-daemonset
  namespace: kube-system
spec:
  selector:
    matchLabels:
      name: nvidia-device-plugin-ds
  updateStrategy:
    type: RollingUpdate
  template:
    metadata:
      # This annotation is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      annotations:
        scheduler.alpha.kubernetes.io/critical-pod: ""
      labels:
        name: nvidia-device-plugin-ds
    spec:
      tolerations:
      # This toleration is deprecated. Kept here for backward compatibility
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      - key: CriticalAddonsOnly
        operator: Exists
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      # Mark this pod as a critical add-on; when enabled, the critical add-on
      # scheduler reserves resources for critical add-on pods so that they can
      # be rescheduled after a failure.
      # See https://kubernetes.io/docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/
      priorityClassName: "system-node-critical"
      containers:
      - image: nvidia/k8s-device-plugin:1.0.0-beta4
        name: nvidia-device-plugin-ctr
        securityContext:
          allowPrivilegeEscalation: false
          capabilities:
            drop: ["ALL"]
        volumeMounts:
          - name: device-plugin
            mountPath: /var/lib/kubelet/device-plugins
      volumes:
        - name: device-plugin
          hostPath:
            path: /var/lib/kubelet/device-plugins

3.4. 运行GPU任务

创建一个GPU的pod,pod的资源类型指定为nvidia.com/gpu

apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
    - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
      image: nvidia/digits:6.0
      resources:
        limits:
          nvidia.com/gpu: 2 # requesting 2 GPUs

4. 构建和运行nvidia-device-plugin

4.1. docker方式

4.1.1. 编译

  • 直接拉取dockerhub的镜像
$ docker pull nvidia/k8s-device-plugin:1.0.0-beta4
  • 拉取代码构建镜像
$ docker build -t nvidia/k8s-device-plugin:1.0.0-beta4 https://github.com/NVIDIA/k8s-device-plugin.git#1.0.0-beta4
  • 修改nvidia-device-plugin后构建镜像
$ git clone https://github.com/NVIDIA/k8s-device-plugin.git && cd k8s-device-plugin
$ git checkout 1.0.0-beta4
$ docker build -t nvidia/k8s-device-plugin:1.0.0-beta4 .

4.1.2. 运行

  • docker本地运行
$ docker run --security-opt=no-new-privileges --cap-drop=ALL --network=none -it -v /var/lib/kubelet/device-plugins:/var/lib/kubelet/device-plugins nvidia/k8s-device-plugin:1.0.0-beta4
  • daemonset运行
$ kubectl create -f nvidia-device-plugin.yml

4.2. 非docker方式

4.2.1. 编译

$ C_INCLUDE_PATH=/usr/local/cuda/include LIBRARY_PATH=/usr/local/cuda/lib64 go build

4.2.2. 本地运行

$ ./k8s-device-plugin

参考:

13 - Etcd集群

13.1 - Etcd介绍

1. Etcd是什么(what)

etcd is a distributed, consistent key-value store for shared configuration and service discovery, with a focus on being:

  • Secure: automatic TLS with optional client cert authentication[可选的SSL客户端证书认证:支持https访问 ]
  • Fast: benchmarked 10,000 writes/sec[单实例每秒 1000 次写操作]
  • Reliable: properly distributed using Raft[使用Raft保证一致性]

etcd是一个分布式、一致性的键值存储系统,主要用于配置共享和服务发现。[以上内容来自etcd官网]

2. 为什么使用Etcd(why)

2.1. Etcd的优势

  1. 简单。使用Go语言编写部署简单;使用HTTP作为接口使用简单;使用Raft算法保证强一致性让用户易于理解。
  2. 数据持久化。etcd默认数据一更新就进行持久化。
  3. 安全。etcd支持SSL客户端安全认证。

3. 如何实现Etcd架构(how)

3.1. Etcd的相关名词解释

  • Raft:etcd所采用的保证分布式系统强一致性的算法。
  • Node:一个Raft状态机实例。
  • Member: 一个etcd实例。它管理着一个Node,并且可以为客户端请求提供服务。
  • Cluster:由多个Member构成可以协同工作的etcd集群。
  • Peer:对同一个etcd集群中另外一个Member的称呼。
  • Client: 向etcd集群发送HTTP请求的客户端。
  • WAL:预写式日志,etcd用于持久化存储的日志格式。
  • snapshot:etcd防止WAL文件过多而设置的快照,存储etcd数据状态。
  • Proxy:etcd的一种模式,为etcd集群提供反向代理服务。
  • Leader:Raft算法中通过竞选而产生的处理所有数据提交的节点。
  • Follower:竞选失败的节点作为Raft中的从属节点,为算法提供强一致性保证。
  • Candidate:当Follower超过一定时间接收不到Leader的心跳时转变为Candidate开始竞选。【候选人】
  • Term:某个节点成为Leader到下一次竞选时间,称为一个Term。【任期】
  • Index:数据项编号。Raft中通过Term和Index来定位数据。

3.2. Etcd的架构图

etcd的架构图

一个用户的请求发送过来,会经由HTTP Server转发给Store进行具体的事务处理,如果涉及到节点的修改,则交给Raft模块进行状态的变更、日志的记录,然后再同步给别的etcd节点以确认数据提交,最后进行数据的提交,再次同步。

1、HTTP Server:

用于处理用户发送的API请求以及其它etcd节点的同步与心跳信息请求。

2、Raft:

Raft强一致性算法的具体实现,是etcd的核心。

3、WAL:

Write Ahead Log(预写式日志),是etcd的数据存储方式,用于系统提供原子性和持久性的一系列技术。除了在内存中存有所有数据的状态以及节点的索引以外,etcd就通过WAL进行持久化存储。WAL中,所有的数据提交前都会事先记录日志。

  1. Entry[日志内容]:

    负责存储具体日志的内容。

  2. Snapshot[快照内容]:

    Snapshot是为了防止数据过多而进行的状态快照,日志内容发生变化时保存Raft的状态。

4、Store:

用于处理etcd支持的各类功能的事务,包括数据索引、节点状态变更、监控与反馈、事件处理与执行等等,是etcd对用户提供的大多数API功能的具体实现。

13.2 - Raft算法

1. Raft协议[分布式一致性算法]

raft

raft算法中涉及三种角色,分别是:

  • follower: 跟随者
  • candidate: 候选者,选举过程中的中间状态角色
  • leader: 领导者

2. 过程

2.1. 选举

有两个timeout来控制选举,第一个是election timeout,该时间是节点从follower到成为candidate的时间,该时间是150到300毫秒之间的随机值。另一个是heartbeat timeout

  • 当某个节点经历完election timeout成为candidate后,开启新的一个选举周期,他向其他节点发起投票请求(Request Vote),如果接收到消息的节点在该周期内还没投过票则给这个candidate投票,然后节点重置他的election timeout。
  • 当该candidate获得大部分的选票,则可以当选为leader。
  • leader就开始发送append entries给其他follower节点,这个消息会在内部指定的heartbeat timeout时间内发出,follower收到该信息则响应给leader。
  • 这个选举周期会继续,直到某个follower没有收到心跳,并成为candidate。
  • 如果某个选举周期内,有两个candidate同时获得相同多的选票,则会等待一个新的周期重新选举。

2.2. 同步

当选举过程结束,选出了leader,则leader需要把所有的变更同步的系统中的其他节点,该同步也是通过发送Append Entries的消息的方式。

  • 首先一个客户端发送一个更新给leader,这个更新会添加到leader的日志中。
  • 然后leader会在给follower的下次心跳探测中发送该更新。
  • 一旦大多数follower收到这个更新并返回给leader,leader提交这个更新,然后返回给客户端。

2.3. 网络分区

  • 当发生网络分区的时候,在不同分区的节点接收不到leader的心跳,则会开启一轮选举,形成不同leader的多个分区集群。
  • 当客户端给不同leader的发送更新消息时,不同分区集群中的节点个数小于原先集群的一半时,更新不会被提交,而节点个数大于集群数一半时,更新会被提交。
  • 当网络分区恢复后,被提交的更新会同步到其他的节点上,其他节点未提交的日志会被回滚并匹配新leader的日志,保证全局的数据是一致的。

参考:

13.3 - etcdctl命令工具

13.3.1 - etcdctl-V3

etcdctlv3版本与v2版本使用命令有所不同,本文介绍etcdctl v3版本的命令工具的使用方式。

1. etcdctl的安装

etcdctl的二进制文件可以在 github.com/coreos/etcd/releases 选择对应的版本下载,例如可以执行以下install_etcdctl.sh的脚本,修改其中的版本信息。

#!/bin/bash
ETCD_VER=v3.3.4
ETCD_DIR=etcd-download
DOWNLOAD_URL=https://github.com/coreos/etcd/releases/download

# Download
mkdir ${ETCD_DIR}
cd ${ETCD_DIR}
wget ${DOWNLOAD_URL}/${ETCD_VER}/etcd-${ETCD_VER}-linux-amd64.tar.gz 
tar -xzvf etcd-${ETCD_VER}-linux-amd64.tar.gz

# install
cd etcd-${ETCD_VER}-linux-amd64
cp etcdctl /usr/local/bin/

2. etcdctl V3

使用etcdctlv3的版本时,需设置环境变量ETCDCTL_API=3

export ETCDCTL_API=3

# 或者在`/etc/profile`文件中添加环境变量
vi /etc/profile
...
export ETCDCTL_API=3
...
source /etc/profile

# 或者在命令执行前加 ETCDCTL_API=3
ETCDCTL_API=3 etcdctl --endpoints=$ENDPOINTS member list

查看当前etcdctl的版本信息etcdctl version

[root@k8s-dbg-master-1 etcd]# etcdctl version
etcdctl version: 3.3.4
API version: 3.3

更多命令帮助可以查询etcdctl —help

[root@k8s-dbg-master-1 etcd]# etcdctl --help
NAME:
	etcdctl - A simple command line client for etcd3.

USAGE:
	etcdctl

VERSION:
	3.3.4

API VERSION:
	3.3


COMMANDS:
	get			Gets the key or a range of keys
	put			Puts the given key into the store
	del			Removes the specified key or range of keys [key, range_end)
	txn			Txn processes all the requests in one transaction
	compaction		Compacts the event history in etcd
	alarm disarm		Disarms all alarms
	alarm list		Lists all alarms
	defrag			Defragments the storage of the etcd members with given endpoints
	endpoint health		Checks the healthiness of endpoints specified in `--endpoints` flag
	endpoint status		Prints out the status of endpoints specified in `--endpoints` flag
	endpoint hashkv		Prints the KV history hash for each endpoint in --endpoints
	move-leader		Transfers leadership to another etcd cluster member.
	watch			Watches events stream on keys or prefixes
	version			Prints the version of etcdctl
	lease grant		Creates leases
	lease revoke		Revokes leases
	lease timetolive	Get lease information
	lease list		List all active leases
	lease keep-alive	Keeps leases alive (renew)
	member add		Adds a member into the cluster
	member remove		Removes a member from the cluster
	member update		Updates a member in the cluster
	member list		Lists all members in the cluster
	snapshot save		Stores an etcd node backend snapshot to a given file
	snapshot restore	Restores an etcd member snapshot to an etcd directory
	snapshot status		Gets backend snapshot status of a given file
	make-mirror		Makes a mirror at the destination etcd cluster
	migrate			Migrates keys in a v2 store to a mvcc store
	lock			Acquires a named lock
	elect			Observes and participates in leader election
	auth enable		Enables authentication
	auth disable		Disables authentication
	user add		Adds a new user
	user delete		Deletes a user
	user get		Gets detailed information of a user
	user list		Lists all users
	user passwd		Changes password of user
	user grant-role		Grants a role to a user
	user revoke-role	Revokes a role from a user
	role add		Adds a new role
	role delete		Deletes a role
	role get		Gets detailed information of a role
	role list		Lists all roles
	role grant-permission	Grants a key to a role
	role revoke-permission	Revokes a key from a role
	check perf		Check the performance of the etcd cluster
	help			Help about any command

OPTIONS:
      --cacert=""				verify certificates of TLS-enabled secure servers using this CA bundle
      --cert=""					identify secure client using this TLS certificate file
      --command-timeout=5s			timeout for short running command (excluding dial timeout)
      --debug[=false]				enable client-side debug logging
      --dial-timeout=2s				dial timeout for client connections
  -d, --discovery-srv=""			domain name to query for SRV records describing cluster endpoints
      --endpoints=[127.0.0.1:2379]		gRPC endpoints
      --hex[=false]				print byte strings as hex encoded strings
      --insecure-discovery[=true]		accept insecure SRV records describing cluster endpoints
      --insecure-skip-tls-verify[=false]	skip server certificate verification
      --insecure-transport[=true]		disable transport security for client connections
      --keepalive-time=2s			keepalive time for client connections
      --keepalive-timeout=6s			keepalive timeout for client connections
      --key=""					identify secure client using this TLS key file
      --user=""					username[:password] for authentication (prompt if password is not supplied)
  -w, --write-out="simple"			set the output format (fields, json, protobuf, simple, table)

3. etcdctl 常用命令

3.1. 指定etcd集群

HOST_1=10.240.0.17
HOST_2=10.240.0.18
HOST_3=10.240.0.19
ENDPOINTS=$HOST_1:2379,$HOST_2:2379,$HOST_3:2379

etcdctl --endpoints=$ENDPOINTS member list

如果etcd设置了证书访问,则需要添加证书相关参数:

ETCDCTL_API=3 etcdctl --endpoints=$ENDPOINTS --cacert=<ca-file> --cert=<cert-file> --key=<key-file>  <command>

参数说明如下:

  --cacert=""				verify certificates of TLS-enabled secure servers using this CA bundle
  --cert=""					identify secure client using this TLS certificate file
  --key=""					identify secure client using this TLS key file
  --endpoints=[127.0.0.1:2379]		gRPC endpoints

可以自定义alias命令

# alias 命令,避免每次需要输入证书参数
alias ectl='ETCDCTL_API=3 etcdctl --endpoints=$ENDPOINTS --cacert=<ca-file> --cert=<cert-file> --key=<key-file>'

# 直接使用别名执行命令
ectl <command>

3.2. 增删改查

1、增

etcdctl --endpoints=$ENDPOINTS put foo "Hello World!"

2、查

etcdctl --endpoints=$ENDPOINTS get foo
etcdctl --endpoints=$ENDPOINTS --write-out="json" get foo

基于相同前缀查找

etcdctl --endpoints=$ENDPOINTS put web1 value1
etcdctl --endpoints=$ENDPOINTS put web2 value2
etcdctl --endpoints=$ENDPOINTS put web3 value3

etcdctl --endpoints=$ENDPOINTS get web --prefix

列出所有的key

etcdctl --endpoints=$ENDPOINTS get / --prefix --keys-only

3、

etcdctl --endpoints=$ENDPOINTS put key myvalue
etcdctl --endpoints=$ENDPOINTS del key

etcdctl --endpoints=$ENDPOINTS put k1 value1
etcdctl --endpoints=$ENDPOINTS put k2 value2
etcdctl --endpoints=$ENDPOINTS del k --prefix

3.3. 集群状态

集群状态主要是etcdctl endpoint statusetcdctl endpoint health两条命令。

etcdctl --write-out=table --endpoints=$ENDPOINTS endpoint status

+------------------+------------------+---------+---------+-----------+-----------+------------+
|     ENDPOINT     |        ID        | VERSION | DB SIZE | IS LEADER | RAFT TERM | RAFT INDEX |
+------------------+------------------+---------+---------+-----------+-----------+------------+
| 10.240.0.17:2379 | 4917a7ab173fabe7 | 3.0.0   | 45 kB   | true      |         4 |      16726 |
| 10.240.0.18:2379 | 59796ba9cd1bcd72 | 3.0.0   | 45 kB   | false     |         4 |      16726 |
| 10.240.0.19:2379 | 94df724b66343e6c | 3.0.0   | 45 kB   | false     |         4 |      16726 |
+------------------+------------------+---------+---------+-----------+-----------+------------+

etcdctl --endpoints=$ENDPOINTS endpoint health

10.240.0.17:2379 is healthy: successfully committed proposal: took = 3.345431ms
10.240.0.19:2379 is healthy: successfully committed proposal: took = 3.767967ms
10.240.0.18:2379 is healthy: successfully committed proposal: took = 4.025451ms

3.4. 集群成员

跟集群成员相关的命令如下:

	member add		    Adds a member into the cluster
	member remove		Removes a member from the cluster
	member update		Updates a member in the cluster
	member list		    Lists all members in the cluster

例如 etcdctl member list列出集群成员的命令。

etcdctl --endpoints=http://172.16.5.4:12379 member list -w table

+-----------------+---------+-------+------------------------+-----------------------------------------------+
|       ID        | STATUS  | NAME  |       PEER ADDRS       |                 CLIENT ADDRS                  |
+-----------------+---------+-------+------------------------+-----------------------------------------------+
| c856d92a82ba66a | started | etcd0 | http://172.16.5.4:2380 | http://172.16.5.4:2379,http://172.16.5.4:4001 |
+-----------------+---------+-------+------------------------+-----------------------------------------------+

4. etcdctl get

使用etcdctl {command} --help可以查看具体命令的帮助信息。

# etcdctl get --help
NAME:
	get - Gets the key or a range of keys

USAGE:
	etcdctl get [options] <key> [range_end]

OPTIONS:
      --consistency="l"			Linearizable(l) or Serializable(s)
      --from-key[=false]		Get keys that are greater than or equal to the given key using byte compare
      --keys-only[=false]		Get only the keys
      --limit=0				Maximum number of results
      --order=""			Order of results; ASCEND or DESCEND (ASCEND by default)
      --prefix[=false]			Get keys with matching prefix
      --print-value-only[=false]	Only write values when using the "simple" output format
      --rev=0				Specify the kv revision
      --sort-by=""			Sort target; CREATE, KEY, MODIFY, VALUE, or VERSION

GLOBAL OPTIONS:
      --cacert=""				verify certificates of TLS-enabled secure servers using this CA bundle
      --cert=""					identify secure client using this TLS certificate file
      --command-timeout=5s			timeout for short running command (excluding dial timeout)
      --debug[=false]				enable client-side debug logging
      --dial-timeout=2s				dial timeout for client connections
      --endpoints=[127.0.0.1:2379]		gRPC endpoints
      --hex[=false]				print byte strings as hex encoded strings
      --insecure-skip-tls-verify[=false]	skip server certificate verification
      --insecure-transport[=true]		disable transport security for client connections
      --key=""					identify secure client using this TLS key file
      --user=""					username[:password] for authentication (prompt if password is not supplied)
  -w, --write-out="simple"			set the output format (fields, json, protobuf, simple, table)

文章参考:

https://coreos.com/etcd/docs/latest/demo.html

13.3.2 - etcdctl-V2

1. etcdctl介绍

etcdctl是一个命令行的客户端,它提供了一下简洁的命令,可理解为命令工具集,可以方便我们在对服务进行测试或者手动修改数据库内容。etcdctl与其他xxxctl的命令原理及操作类似(例如kubectl,systemctl)。

用法:etcdctl [global options] command [command options][args...]

2. Etcd常用命令

2.1. 数据库操作命令

etcd 在键的组织上采用了层次化的空间结构(类似于文件系统中目录的概念),数据库操作围绕对键值和目录的 CRUD [增删改查](符合 REST 风格的一套操作:Create, Read, Update, Delete)完整生命周期的管理。

具体的命令选项参数可以通过 etcdctl command --help来获取相关帮助。

2.1.1. 对象为键值

  1. set[增:无论是否存在]:etcdctl set key value

  2. mk[增:必须不存在]:etcdctl mk key value

  3. rm[删]:etcdctl rm key

  4. update[改]:etcdctl update key value

  5. get[查]:etcdctl get key

2.1.2. 对象为目录

  1. setdir[增:无论是否存在]:etcdctl setdir dir

  2. mkdir[增:必须不存在]: etcdctl mkdir dir

  3. rmdir[删]:etcdctl rmdir dir

  4. updatedir[改]:etcdctl updatedir dir

  5. ls[查]:etcdclt ls

2.2. 非数据库操作命令

  1. backup[备份 etcd 的数据]

    etcdctl backup

  2. watch[监测一个键值的变化,一旦键值发生更新,就会输出最新的值并退出]

    etcdctl watch key

  3. exec-watch[监测一个键值的变化,一旦键值发生更新,就执行给定命令]

    etcdctl exec-watch key --sh -c "ls"

  4. member[通过 list、add、remove、update 命令列出、添加、删除 、更新etcd 实例到 etcd 集群中]

    etcdctl member list;etcdctl member add 实例;etcdctl member remove 实例;etcdctl member update 实例。

  5. etcdctl cluster-health[检查集群健康状态]

2.3. 常用配置参数

设置配置文件,默认为/etc/etcd/etcd.conf。

配置参数 参数说明
配置参数 参数说明
-name 节点名称
-data-dir 保存日志和快照的目录,默认为当前工作目录,指定节点的数据存储目录
-addr 公布的ip地址和端口。 默认为127.0.0.1:2379
-bind-addr 用于客户端连接的监听地址,默认为-addr配置
-peers 集群成员逗号分隔的列表,例如 127.0.0.1:2380,127.0.0.1:2381
-peer-addr 集群服务通讯的公布的IP地址,默认为 127.0.0.1:2380.
-peer-bind-addr 集群服务通讯的监听地址,默认为-peer-addr配置
-wal-dir 指定节点的was文件的存储目录,若指定了该参数,wal文件会和其他数据文件分开存储
-listen-client-urls
-listen-peer-urls 监听URL,用于与其他节点通讯
-initial-advertise-peer-urls 告知集群其他节点url.
-advertise-client-urls 告知客户端url, 也就是服务的url
-initial-cluster-token 集群的ID
-initial-cluster 集群中所有节点
-initial-cluster-state -initial-cluster-state=new 表示从无到有搭建etcd集群
-discovery-srv 用于DNS动态服务发现,指定DNS SRV域名
-discovery 用于etcd动态发现,指定etcd发现服务的URL [https://discovery.etcd.io/],用环境变量表示

13.4 - Etcd访问控制

1. ETCD资源类型

There are three types of resources in etcd

  • permission resources: users and roles in the user store
  • key-value resources: key-value pairs in the key-value store
  • settings resources: security settings, auth settings, and dynamic etcd cluster settings (election/heartbeat)

2. 权限资源

Users:user用来设置身份认证(user:passwd),一个用户可以拥有多个角色,每个角色被分配一定的权限(只读、只写、可读写),用户分为root用户和非root用户。

Roles:角色用来关联权限,角色主要三类:root角色。默认创建root用户时即创建了root角色,该角色拥有所有权限;guest角色,默认自动创建,主要用于非认证使用。普通角色,由root用户创建角色,并分配指定权限。

注意:如果没有指定任何验证方式,即没显示指定以什么用户进行访问,那么默认会设定为 guest 角色。默认情况下 guest 也是具有全局访问权限的。如果不希望未授权就获取或修改etcd的数据,则可收回guest角色的权限或删除该角色,etcdctl role revoke 。

Permissions:权限分为只读、只写、可读写三种权限,权限即对指定目录或key的读写权限。

3. ETCD访问控制

3.1. 访问控制相关命令

NAME:
   etcdctl - A simple command line client for etcd.
USAGE:
   etcdctl [global options] command [command options] [arguments...]
VERSION:
   2.2.0
COMMANDS:
   user         user add, grant and revoke subcommands
   role         role add, grant and revoke subcommands
   auth         overall auth controls  
GLOBAL OPTIONS:
   --peers, -C          a comma-delimited list of machine addresses in the cluster (default: "http://127.0.0.1:4001,http://127.0.0.1:2379")
   --endpoint           a comma-delimited list of machine addresses in the cluster (default: "http://127.0.0.1:4001,http://127.0.0.1:2379")
   --cert-file          identify HTTPS client using this SSL certificate file
   --key-file           identify HTTPS client using this SSL key file
   --ca-file            verify certificates of HTTPS-enabled servers using this CA bundle
   --username, -u       provide username[:password] and prompt if password is not supplied.
   --timeout '1s'       connection timeout per request

3.2. user相关命令

[root@localhost etcd]# etcdctl user --help
NAME:
   etcdctl user - user add, grant and revoke subcommands
USAGE:
   etcdctl user command [command options] [arguments...]
COMMANDS:
   add      add a new user for the etcd cluster
   get      get details for a user
   list     list all current users
   remove   remove a user for the etcd cluster
   grant    grant roles to an etcd user
   revoke   revoke roles for an etcd user
   passwd   change password for a user
   help, h  Shows a list of commands or help for one command
    
OPTIONS:
   --help, -h   show help

3.2.1. 添加root用户并设置密码

etcdctl --endpoints http://172.16.22.36:2379 user add root

3.2.2. 添加非root用户并设置密码

etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user add huwh

3.2.3. 查看当前所有用户

etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user list

3.2.4. 将用户添加到对应角色

etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user grant --roles test1 phpor

3.2.5. 查看用户拥有哪些角色

etcdctl --endpoints http://172.16.22.36:2379 --username root:123 user get phpor

3.3. role相关命令

[root@localhost etcd]# etcdctl role --help
NAME:
   etcdctl role - role add, grant and revoke subcommands
USAGE:
   etcdctl role command [command options] [arguments...]
COMMANDS:
   add      add a new role for the etcd cluster
   get      get details for a role
   list     list all roles
   remove   remove a role from the etcd cluster
   grant    grant path matches to an etcd role
   revoke   revoke path matches for an etcd role
   help, h  Shows a list of commands or help for one command
    
OPTIONS:
   --help, -h   show help

3.3.1. 添加角色

etcdctl --endpoints http://172.16.22.36:2379 --username root:2379 role add test1

3.3.2. 查看所有角色

etcdctl --endpoints http://172.16.22.36:2379 --username root:123 role list

3.3.3. 给角色分配权限

[root@localhost etcd]# etcdctl role grant --help
NAME:
   grant - grant path matches to an etcd role
USAGE:
   command grant [command options] [arguments...]
OPTIONS:
   --path   Path granted for the role to access
   --read   Grant read-only access
   --write  Grant write-only access
   --readwrite  Grant read-write access

1、只包含目录 etcdctl --endpoints http://172.16.22.36:2379 --username root:123 role grant --readwrite --path /test1 test1

2、包括目录和子目录或文件 etcdctl --endpoints http://172.16.22.36:2379 --username root:123 role grant --readwrite --path /test1/* test1

3.3.4. 查看角色所拥有的权限

etcdctl --endpoints http://172.16.22.36:2379 --username root:2379 role get test1

3.4. auth相关操作

[root@localhost etcd]# etcdctl auth --help
NAME:
   etcdctl auth - overall auth controls
USAGE:
   etcdctl auth command [command options] [arguments...]
COMMANDS:
   enable   enable auth access controls
   disable  disable auth access controls
   help, h  Shows a list of commands or help for one command
    
OPTIONS:
   --help, -h   show help

3.4.1. 开启认证

etcdctl --endpoints http://172.16.22.36:2379 auth enable

4. 访问控制设置步骤

顺序 步骤 命令
1 添加root用户 etcdctl --endpoints http://: user add root
2 开启认证 etcdctl --endpoints http://: auth enable
3 添加非root用户 etcdctl --endpoints http://: –username root: user add
4 添加角色 etcdctl --endpoints http://: –username root: role add
5 给角色授权(只读、只写、可读写) etcdctl --endpoints http://: –username root: role grant --readwrite --path
6 给用户分配角色(即分配了角色对应的权限) etcdctl --endpoints http://: –username root: user grant --roles

5. 访问认证的API调用

更多参考

13.5 - Etcd启动配置参数

1. Etcd配置参数

/ # etcd --help
usage: etcd [flags]
       start an etcd server

       etcd --version
       show the version of etcd

       etcd -h | --help
       show the help information about etcd

       etcd --config-file
       path to the server configuration file

       etcd gateway
       run the stateless pass-through etcd TCP connection forwarding proxy

       etcd grpc-proxy
       run the stateless etcd v3 gRPC L7 reverse proxy

1.1. member flags

member flags:

	--name 'default'
		human-readable name for this member.
	--data-dir '${name}.etcd'
		path to the data directory.
	--wal-dir ''
		path to the dedicated wal directory.
	--snapshot-count '100000'
		number of committed transactions to trigger a snapshot to disk.
	--heartbeat-interval '100'
		time (in milliseconds) of a heartbeat interval.
	--election-timeout '1000'
		time (in milliseconds) for an election to timeout. See tuning documentation for details.
	--initial-election-tick-advance 'true'
		whether to fast-forward initial election ticks on boot for faster election.
	--listen-peer-urls 'http://localhost:2380'
		list of URLs to listen on for peer traffic.
	--listen-client-urls 'http://localhost:2379'
		list of URLs to listen on for client traffic.
	--max-snapshots '5'
		maximum number of snapshot files to retain (0 is unlimited).
	--max-wals '5'
		maximum number of wal files to retain (0 is unlimited).
	--cors ''
		comma-separated whitelist of origins for CORS (cross-origin resource sharing).
	--quota-backend-bytes '0'
		raise alarms when backend size exceeds the given quota (0 defaults to low space quota).
	--max-txn-ops '128'
		maximum number of operations permitted in a transaction.
	--max-request-bytes '1572864'
		maximum client request size in bytes the server will accept.
	--grpc-keepalive-min-time '5s'
		minimum duration interval that a client should wait before pinging server.
	--grpc-keepalive-interval '2h'
		frequency duration of server-to-client ping to check if a connection is alive (0 to disable).
	--grpc-keepalive-timeout '20s'
		additional duration of wait before closing a non-responsive connection (0 to disable).

1.2. clustering flags

clustering flags:

	--initial-advertise-peer-urls 'http://localhost:2380'
		list of this member's peer URLs to advertise to the rest of the cluster.
	--initial-cluster 'default=http://localhost:2380'
		initial cluster configuration for bootstrapping.
	--initial-cluster-state 'new'
		initial cluster state ('new' or 'existing').
	--initial-cluster-token 'etcd-cluster'
		initial cluster token for the etcd cluster during bootstrap.
		Specifying this can protect you from unintended cross-cluster interaction when running multiple clusters.
	--advertise-client-urls 'http://localhost:2379'
		list of this member's client URLs to advertise to the public.
		The client URLs advertised should be accessible to machines that talk to etcd cluster. etcd client libraries parse these URLs to connect to the cluster.
	--discovery ''
		discovery URL used to bootstrap the cluster.
	--discovery-fallback 'proxy'
		expected behavior ('exit' or 'proxy') when discovery services fails.
		"proxy" supports v2 API only.
	--discovery-proxy ''
		HTTP proxy to use for traffic to discovery service.
	--discovery-srv ''
		dns srv domain used to bootstrap the cluster.
	--strict-reconfig-check 'true'
		reject reconfiguration requests that would cause quorum loss.
	--auto-compaction-retention '0'
		auto compaction retention length. 0 means disable auto compaction.
	--auto-compaction-mode 'periodic'
		interpret 'auto-compaction-retention' one of: periodic|revision. 'periodic' for duration based retention, defaulting to hours if no time unit is provided (e.g. '5m'). 'revision' for revision number based retention.
	--enable-v2 'true'
		Accept etcd V2 client requests.

1.3. proxy flags

proxy flags:
	"proxy" supports v2 API only.

	--proxy 'off'
		proxy mode setting ('off', 'readonly' or 'on').
	--proxy-failure-wait 5000
		time (in milliseconds) an endpoint will be held in a failed state.
	--proxy-refresh-interval 30000
		time (in milliseconds) of the endpoints refresh interval.
	--proxy-dial-timeout 1000
		time (in milliseconds) for a dial to timeout.
	--proxy-write-timeout 5000
		time (in milliseconds) for a write to timeout.
	--proxy-read-timeout 0
		time (in milliseconds) for a read to timeout.

1.4. security flags

security flags:

	--ca-file '' [DEPRECATED]
		path to the client server TLS CA file. '-ca-file ca.crt' could be replaced by '-trusted-ca-file ca.crt -client-cert-auth' and etcd will perform the same.
	--cert-file ''
		path to the client server TLS cert file.
	--key-file ''
		path to the client server TLS key file.
	--client-cert-auth 'false'
		enable client cert authentication.
	--client-crl-file ''
		path to the client certificate revocation list file.
	--trusted-ca-file ''
		path to the client server TLS trusted CA cert file.
	--auto-tls 'false'
		client TLS using generated certificates.
	--peer-ca-file '' [DEPRECATED]
		path to the peer server TLS CA file. '-peer-ca-file ca.crt' could be replaced by '-peer-trusted-ca-file ca.crt -peer-client-cert-auth' and etcd will perform the same.
	--peer-cert-file ''
		path to the peer server TLS cert file.
	--peer-key-file ''
		path to the peer server TLS key file.
	--peer-client-cert-auth 'false'
		enable peer client cert authentication.
	--peer-trusted-ca-file ''
		path to the peer server TLS trusted CA file.
	--peer-auto-tls 'false'
		peer TLS using self-generated certificates if --peer-key-file and --peer-cert-file are not provided.
	--peer-crl-file ''
		path to the peer certificate revocation list file.

1.5. logging flags

logging flags

	--debug 'false'
		enable debug-level logging for etcd.
	--log-package-levels ''
		specify a particular log level for each etcd package (eg: 'etcdmain=CRITICAL,etcdserver=DEBUG').
	--log-output 'default'
		specify 'stdout' or 'stderr' to skip journald logging even when running under systemd.

1.6. unsafe flags

unsafe flags:

Please be CAUTIOUS when using unsafe flags because it will break the guarantees
given by the consensus protocol.

	--force-new-cluster 'false'
		force to create a new one-member cluster.

1.7. profiling flags

profiling flags:
	--enable-pprof 'false'
		Enable runtime profiling data via HTTP server. Address is at client URL + "/debug/pprof/"
	--metrics 'basic'
		Set level of detail for exported metrics, specify 'extensive' to include histogram metrics.
	--listen-metrics-urls ''
		List of URLs to listen on for metrics.

1.8. auth flags

auth flags:
	--auth-token 'simple'
		Specify a v3 authentication token type and its options ('simple' or 'jwt').

1.9. experimental flags

experimental flags:
	--experimental-initial-corrupt-check 'false'
		enable to check data corruption before serving any client/peer traffic.
	--experimental-corrupt-check-time '0s'
		duration of time between cluster corruption check passes.
	--experimental-enable-v2v3 ''
		serve v2 requests through the v3 backend under a given prefix.

13.6 - Etcd中的k8s数据

1. 读取数据key

使用以下命令列出所有的key。

ETCDCTL_API=3 etcdctl --endpoints=<etcd-ip-1>:2379,<etcd-ip-2>:2379,<etcd-ip-3>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt  --key=/etc/kubernetes/pki/apiserver-etcd-client.key  --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt get / --prefix --keys-only

参数说明:

  --cacert=""				verify certificates of TLS-enabled secure servers using this CA bundle
  --cert=""					identify secure client using this TLS certificate file
  --key=""					identify secure client using this TLS key file
  --endpoints=[127.0.0.1:2379]		gRPC endpoints

可以使用alias来重命名etcdctl一串的命令

alias ectl='ETCDCTL_API=3 etcdctl --endpoints=<etcd-ip-1>:2379,<etcd-ip-2>:2379,<etcd-ip-3>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt  --key=/etc/kubernetes/pki/apiserver-etcd-client.key  --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt'

2. 集群数据

2.1. node

/registry/minions/<node-ip-1>
/registry/minions/<node-ip-2>
/registry/minions/<node-ip-3>

其他信息:

/registry/leases/kube-node-lease/<node-ip-1>
/registry/leases/kube-node-lease/<node-ip-2>
/registry/leases/kube-node-lease/<node-ip-3>

/registry/masterleases/<node-ip-2>
/registry/masterleases/<node-ip-3>

3. k8s对象数据

k8s对象数据的格式

3.1. namespace

/registry/namespaces/default
/registry/namespaces/game
/registry/namespaces/kube-node-lease
/registry/namespaces/kube-public
/registry/namespaces/kube-system

3.2. namespace级别对象

/registry/{resource}/{namespace}/{resource_name}

以下以常见k8s对象为例:

# deployment
/registry/deployments/default/game-2048
/registry/deployments/kube-system/prometheus-operator

# replicasets
/registry/replicasets/default/game-2048-c7d589ccf

# pod
/registry/pods/default/game-2048-c7d589ccf-8lsbw

# statefulsets
/registry/statefulsets/kube-system/prometheus-k8s

# daemonsets
/registry/daemonsets/kube-system/kube-proxy

# secrets
/registry/secrets/default/default-token-tbfmb

# serviceaccounts
/registry/serviceaccounts/default/default

service

# service
/registry/services/specs/default/game-2048

# endpoints
/registry/services/endpoints/default/game-2048

4. 读取数据value

由于k8s默认etcd中的数据是通过protobuf格式存储,因此看到的key和value的值是一串字符串。

alias ectl='ETCDCTL_API=3 etcdctl --endpoints=:2379,:2379,:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt --key=/etc/kubernetes/pki/apiserver-etcd-client.key --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt'

# ectl get /registry/namespaces/test -w json |jq
{
  "header": {
    "cluster_id": 12113422651334595000,
    "member_id": 8381627376898157000,
    "revision": 12321629,
    "raft_term": 20
  },
  "kvs": [
    {
      "key": "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvdGVzdA==",
      "create_revision": 11670741,
      "mod_revision": 11670741,
      "version": 1,
      "value": "azhzAAoPCgJ2MRIJTmFtZXNwYWNlElwKQgoEdGVzdBIAGgAiACokYWM1YmJjOTQtNTkxZi0xMWVhLWJiOTQtNmM5MmJmM2I3NmI1MgA4AEIICJuf3fIFEAB6ABIMCgprdWJlcm5ldGVzGggKBkFjdGl2ZRoAIgA="
    }
  ],
  "count": 1
}

其中key可以通过base64解码出来

echo "L3JlZ2lzdHJ5L25hbWVzcGFjZXMvdGVzdA==" | base64 --decode

# output
/registry/namespaces/test

value是值可以通过安装etcdhelper工具解析出来。

alias ehelper='etcdhelper -key /etc/kubernetes/pki/apiserver-etcd-client.key -cert /etc/kubernetes/pki/apiserver-etcd-client.crt -cacert /etc/kubernetes/pki/etcd/ca.crt'

# ehelper get /registry/namespaces/test
/v1, Kind=Namespace
{
  "kind": "Namespace",
  "apiVersion": "v1",
  "metadata": {
    "name": "test",
    "uid": "ac5bbc94-591f-11ea-bb94-6c92bf3b76b5",
    "creationTimestamp": "2020-02-27T05:11:55Z"
  },
  "spec": {
    "finalizers": [
      "kubernetes"
    ]
  },
  "status": {
    "phase": "Active"
  }
}

5. 注意事项

  • 由于k8s的etcd数据为了性能考虑,默认通过protobuf格式存储,不要通过手动的方式去修改或添加k8s数据。
  • 不推荐使用json格式存储etcd数据,如果需要json格式,可以使用--storage-media-type=application/json参数存储,参考:https://github.com/kubernetes/kubernetes/issues/44670

6. 快捷命令

由于etcdctl的命令需要添加很多认证参数和endpoints的参数,因此可以使用别名的方式来简化命令。

# etcdctl 
alias ectl='ETCDCTL_API=3 etcdctl --endpoints=<etcd-ip-1>:2379,<etcd-ip-2>:2379,<etcd-ip-3>:2379 --cacert=/etc/kubernetes/pki/etcd/ca.crt  --key=/etc/kubernetes/pki/apiserver-etcd-client.key  --cert=/etc/kubernetes/pki/apiserver-etcd-client.crt'

# etcdhelper
alias ehelper='etcdhelper -key /etc/kubernetes/pki/apiserver-etcd-client.key -cert /etc/kubernetes/pki/apiserver-etcd-client.crt -cacert /etc/kubernetes/pki/etcd/ca.crt'

6.1. etcdhelper的使用

etcdhelper文档参考:https://github.com/openshift/origin/tree/master/tools/etcdhelper

# 必要的认证参数
-key - points to master.etcd-client.key
-cert - points to master.etcd-client.crt
-cacert - points to ca.crt

# 命令操作参数
ls - list all keys starting with prefix
get - get the specific value of a key
dump - dump the entire contents of the etcd

示例

$ ehelper ls /registry/leases/
/registry/leases/kube-node-lease/<ip-1>
/registry/leases/kube-node-lease/<ip-2>
/registry/leases/kube-node-lease/<ip-3>

$ ehelper get <key>

7. RBAC

附RBAC相关的key。

clusterrolebindings

/registry/clusterrolebindings/cluster-admin
/registry/clusterrolebindings/flannel
/registry/clusterrolebindings/galaxy
/registry/clusterrolebindings/helm
/registry/clusterrolebindings/kube-state-metrics
/registry/clusterrolebindings/kubeadm:kubelet-bootstrap
/registry/clusterrolebindings/kubeadm:node-autoapprove-bootstrap
/registry/clusterrolebindings/kubeadm:node-autoapprove-certificate-rotation
/registry/clusterrolebindings/kubeadm:node-proxier
/registry/clusterrolebindings/lbcf-controller
/registry/clusterrolebindings/prometheus-k8s
/registry/clusterrolebindings/prometheus-operator
/registry/clusterrolebindings/system:aws-cloud-provider
/registry/clusterrolebindings/system:basic-user
/registry/clusterrolebindings/system:controller:attachdetach-controller
/registry/clusterrolebindings/system:controller:certificate-controller
/registry/clusterrolebindings/system:controller:clusterrole-aggregation-controller
/registry/clusterrolebindings/system:controller:cronjob-controller
/registry/clusterrolebindings/system:controller:daemon-set-controller
/registry/clusterrolebindings/system:controller:deployment-controller
/registry/clusterrolebindings/system:controller:disruption-controller
/registry/clusterrolebindings/system:controller:endpoint-controller
/registry/clusterrolebindings/system:controller:expand-controller
/registry/clusterrolebindings/system:controller:generic-garbage-collector
/registry/clusterrolebindings/system:controller:horizontal-pod-autoscaler
/registry/clusterrolebindings/system:controller:job-controller
/registry/clusterrolebindings/system:controller:namespace-controller
/registry/clusterrolebindings/system:controller:node-controller
/registry/clusterrolebindings/system:controller:persistent-volume-binder
/registry/clusterrolebindings/system:controller:pod-garbage-collector
/registry/clusterrolebindings/system:controller:pv-protection-controller
/registry/clusterrolebindings/system:controller:pvc-protection-controller
/registry/clusterrolebindings/system:controller:replicaset-controller
/registry/clusterrolebindings/system:controller:replication-controller
/registry/clusterrolebindings/system:controller:resourcequota-controller
/registry/clusterrolebindings/system:controller:route-controller
/registry/clusterrolebindings/system:controller:service-account-controller
/registry/clusterrolebindings/system:controller:service-controller
/registry/clusterrolebindings/system:controller:statefulset-controller
/registry/clusterrolebindings/system:controller:ttl-controller
/registry/clusterrolebindings/system:coredns
/registry/clusterrolebindings/system:discovery
/registry/clusterrolebindings/system:kube-controller-manager
/registry/clusterrolebindings/system:kube-dns
/registry/clusterrolebindings/system:kube-scheduler
/registry/clusterrolebindings/system:node
/registry/clusterrolebindings/system:node-proxier
/registry/clusterrolebindings/system:public-info-viewer
/registry/clusterrolebindings/system:volume-scheduler

clusterroles

/registry/clusterroles/admin
/registry/clusterroles/cluster-admin
/registry/clusterroles/edit
/registry/clusterroles/flannel
/registry/clusterroles/kube-state-metrics
/registry/clusterroles/lbcf-controller
/registry/clusterroles/prometheus-k8s
/registry/clusterroles/prometheus-operator
/registry/clusterroles/system:aggregate-to-admin
/registry/clusterroles/system:aggregate-to-edit
/registry/clusterroles/system:aggregate-to-view
/registry/clusterroles/system:auth-delegator
/registry/clusterroles/system:aws-cloud-provider
/registry/clusterroles/system:basic-user
/registry/clusterroles/system:certificates.k8s.io:certificatesigningrequests:nodeclient
/registry/clusterroles/system:certificates.k8s.io:certificatesigningrequests:selfnodeclient
/registry/clusterroles/system:controller:attachdetach-controller
/registry/clusterroles/system:controller:certificate-controller
/registry/clusterroles/system:controller:clusterrole-aggregation-controller
/registry/clusterroles/system:controller:cronjob-controller
/registry/clusterroles/system:controller:daemon-set-controller
/registry/clusterroles/system:controller:deployment-controller
/registry/clusterroles/system:controller:disruption-controller
/registry/clusterroles/system:controller:endpoint-controller
/registry/clusterroles/system:controller:expand-controller
/registry/clusterroles/system:controller:generic-garbage-collector
/registry/clusterroles/system:controller:horizontal-pod-autoscaler
/registry/clusterroles/system:controller:job-controller
/registry/clusterroles/system:controller:namespace-controller
/registry/clusterroles/system:controller:node-controller
/registry/clusterroles/system:controller:persistent-volume-binder
/registry/clusterroles/system:controller:pod-garbage-collector
/registry/clusterroles/system:controller:pv-protection-controller
/registry/clusterroles/system:controller:pvc-protection-controller
/registry/clusterroles/system:controller:replicaset-controller
/registry/clusterroles/system:controller:replication-controller
/registry/clusterroles/system:controller:resourcequota-controller
/registry/clusterroles/system:controller:route-controller
/registry/clusterroles/system:controller:service-account-controller
/registry/clusterroles/system:controller:service-controller
/registry/clusterroles/system:controller:statefulset-controller
/registry/clusterroles/system:controller:ttl-controller
/registry/clusterroles/system:coredns
/registry/clusterroles/system:csi-external-attacher
/registry/clusterroles/system:csi-external-provisioner
/registry/clusterroles/system:discovery
/registry/clusterroles/system:heapster
/registry/clusterroles/system:kube-aggregator
/registry/clusterroles/system:kube-controller-manager
/registry/clusterroles/system:kube-dns
/registry/clusterroles/system:kube-scheduler
/registry/clusterroles/system:kubelet-api-admin
/registry/clusterroles/system:node
/registry/clusterroles/system:node-bootstrapper
/registry/clusterroles/system:node-problem-detector
/registry/clusterroles/system:node-proxier
/registry/clusterroles/system:persistent-volume-provisioner
/registry/clusterroles/system:public-info-viewer
/registry/clusterroles/system:volume-scheduler
/registry/clusterroles/view

rolebindings

/registry/rolebindings/kube-public/kubeadm:bootstrap-signer-clusterinfo
/registry/rolebindings/kube-public/system:controller:bootstrap-signer
/registry/rolebindings/kube-system/kube-proxy
/registry/rolebindings/kube-system/kube-state-metrics
/registry/rolebindings/kube-system/kubeadm:kubeadm-certs
/registry/rolebindings/kube-system/kubeadm:kubelet-config-1.14
/registry/rolebindings/kube-system/kubeadm:nodes-kubeadm-config
/registry/rolebindings/kube-system/system::extension-apiserver-authentication-reader
/registry/rolebindings/kube-system/system::leader-locking-kube-controller-manager
/registry/rolebindings/kube-system/system::leader-locking-kube-scheduler
/registry/rolebindings/kube-system/system:controller:bootstrap-signer
/registry/rolebindings/kube-system/system:controller:cloud-provider
/registry/rolebindings/kube-system/system:controller:token-cleaner

roles

/registry/roles/kube-public/kubeadm:bootstrap-signer-clusterinfo
/registry/roles/kube-public/system:controller:bootstrap-signer
/registry/roles/kube-system/extension-apiserver-authentication-reader
/registry/roles/kube-system/kube-proxy
/registry/roles/kube-system/kube-state-metrics-resizer
/registry/roles/kube-system/kubeadm:kubeadm-certs
/registry/roles/kube-system/kubeadm:kubelet-config-1.14
/registry/roles/kube-system/kubeadm:nodes-kubeadm-config
/registry/roles/kube-system/system::leader-locking-kube-controller-manager
/registry/roles/kube-system/system::leader-locking-kube-scheduler
/registry/roles/kube-system/system:controller:bootstrap-signer
/registry/roles/kube-system/system:controller:cloud-provider
/registry/roles/kube-system/system:controller:token-cleaner

参考:

13.7 - etcd-operator的使用

本文主要介绍etcd-operator的部署及使用

1. 部署RBAC

下载create_role.shcluster-role-binding-template.yamlcluster-role-template.yaml

例如:

|-- cluster-role-binding-template.yaml
|-- cluster-role-template.yaml
|-- create_role.sh

# 部署rbac
kubectl create ns operator
bash create_role.sh --namespace=operator  # namespace与etcd-operator的ns一致

示例:

bash create_role.sh --namespace=operator
+ ROLE_NAME=etcd-operator
+ ROLE_BINDING_NAME=etcd-operator
+ NAMESPACE=default
+ for i in '"$@"'
+ case $i in
+ NAMESPACE=operator
+ echo 'Creating role with ROLE_NAME=etcd-operator, NAMESPACE=operator'
Creating role with ROLE_NAME=etcd-operator, NAMESPACE=operator
+ sed -e 's/<ROLE_NAME>/etcd-operator/g' -e 's/<NAMESPACE>/operator/g' cluster-role-template.yaml
+ kubectl create -f -
clusterrole.rbac.authorization.k8s.io/etcd-operator created
+ echo 'Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=operator'
Creating role binding with ROLE_NAME=etcd-operator, ROLE_BINDING_NAME=etcd-operator, NAMESPACE=operator
+ sed -e 's/<ROLE_NAME>/etcd-operator/g' -e 's/<ROLE_BINDING_NAME>/etcd-operator/g' -e 's/<NAMESPACE>/operator/g' cluster-role-binding-template.yaml
+ kubectl create -f -
clusterrolebinding.rbac.authorization.k8s.io/etcd-operator created

1.1. create_role.sh 脚本

create_role.sh有三个入参,可以指定--namespace参数,该参数与etcd-operator部署的namespace应一致。默认为default。

#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail

ETCD_OPERATOR_ROOT=$(dirname "${BASH_SOURCE}")/../..

print_usage() {
  echo "$(basename "$0") - Create Kubernetes RBAC role and role bindings for etcd-operator
Usage: $(basename "$0") [options...]
Options:
  --role-name=STRING         Name of ClusterRole to create
                               (default=\"etcd-operator\", environment variable: ROLE_NAME)
  --role-binding-name=STRING Name of ClusterRoleBinding to create
                               (default=\"etcd-operator\", environment variable: ROLE_BINDING_NAME)
  --namespace=STRING         namespace to create role and role binding in. Must already exist.
                               (default=\"default\", environment variable: NAMESPACE)
" >&2
}

ROLE_NAME="${ROLE_NAME:-etcd-operator}"
ROLE_BINDING_NAME="${ROLE_BINDING_NAME:-etcd-operator}"
NAMESPACE="${NAMESPACE:-default}"

for i in "$@"
do
case $i in
    --role-name=*)
    ROLE_NAME="${i#*=}"
    ;;
    --role-binding-name=*)
    ROLE_BINDING_NAME="${i#*=}"
    ;;
    --namespace=*)
    NAMESPACE="${i#*=}"
    ;;
    -h|--help)
      print_usage
      exit 0
    ;;
    *)
      print_usage
      exit 1
    ;;
esac
done

echo "Creating role with ROLE_NAME=${ROLE_NAME}, NAMESPACE=${NAMESPACE}"
sed -e "s/<ROLE_NAME>/${ROLE_NAME}/g" \
  -e "s/<NAMESPACE>/${NAMESPACE}/g" \
  "cluster-role-template.yaml" | \
  kubectl create -f -

echo "Creating role binding with ROLE_NAME=${ROLE_NAME}, ROLE_BINDING_NAME=${ROLE_BINDING_NAME}, NAMESPACE=${NAMESPACE}"
sed -e "s/<ROLE_NAME>/${ROLE_NAME}/g" \
  -e "s/<ROLE_BINDING_NAME>/${ROLE_BINDING_NAME}/g" \
  -e "s/<NAMESPACE>/${NAMESPACE}/g" \
  "cluster-role-binding-template.yaml" | \
  kubectl create -f -

1.2. cluster-role-binding-template.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRoleBinding
metadata:
  name: <ROLE_BINDING_NAME>
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: <ROLE_NAME>
subjects:
- kind: ServiceAccount
  name: default
  namespace: <NAMESPACE>

1.3. cluster-role-template.yaml

apiVersion: rbac.authorization.k8s.io/v1beta1
kind: ClusterRole
metadata:
  name: <ROLE_NAME>
rules:
- apiGroups:
  - etcd.database.coreos.com
  resources:
  - etcdclusters
  - etcdbackups
  - etcdrestores
  verbs:
  - "*"
- apiGroups:
  - apiextensions.k8s.io
  resources:
  - customresourcedefinitions
  verbs:
  - "*"
- apiGroups:
  - ""
  resources:
  - pods
  - services
  - endpoints
  - persistentvolumeclaims
  - events
  verbs:
  - "*"
- apiGroups:
  - apps
  resources:
  - deployments
  verbs:
  - "*"
# The following permissions can be removed if not using S3 backup and TLS
- apiGroups:
  - ""
  resources:
  - secrets
  verbs:
  - get

2. 部署etcd-operator

kubectl create -f etcd-operator.yaml

etcd-operator.yaml如下:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: etcd-operator
  namespace: operator   # 与rbac指定的ns一致
  labels:
    app: etcd-operator
spec:
  replicas: 1
  selector:
    matchLabels:
      app: etcd-operator
  template:
    metadata:
      labels:
        app: etcd-operator
    spec:
      containers:
      - name: etcd-operator
        image: registry.cn-shenzhen.aliyuncs.com/huweihuang/etcd-operator:v0.9.4
        command:
        - etcd-operator
        # Uncomment to act for resources in all namespaces. More information in doc/user/clusterwide.md
        - -cluster-wide
        env:
        - name: MY_POD_NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace
        - name: MY_POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name

查看CRD

#kubectl get customresourcedefinitions
NAME                                       CREATED AT
etcdclusters.etcd.database.coreos.com      2020-08-01T13:02:18Z

查看etcd-operator的日志是否OK。

k logs -f etcd-operator-545df8d445-qpf6n -n operator
time="2020-08-01T13:02:18Z" level=info msg="etcd-operator Version: 0.9.4"
time="2020-08-01T13:02:18Z" level=info msg="Git SHA: c8a1c64"
time="2020-08-01T13:02:18Z" level=info msg="Go Version: go1.11.5"
time="2020-08-01T13:02:18Z" level=info msg="Go OS/Arch: linux/amd64"
time="2020-08-01T13:02:18Z" level=info msg="Event(v1.ObjectReference{Kind:\"Endpoints\", Namespace:\"operator\", Name:\"etcd-operator\", UID:\"7de38cff-1b7b-4bf2-9837-473fa66c9366\", APIVersion:\"v1\", ResourceVersion:\"41195930\", FieldPath:\"\"}): type: 'Normal' reason: 'LeaderElection' etcd-operator-545df8d445-qpf6n became leader"

以上内容表示etcd-operator运行正常。

3. 部署etcd集群

kubectl create -f etcd-cluster.yaml

当开启clusterwide则etcd集群与etcd-operator的ns可不同。

etcd-cluster.yaml

apiVersion: "etcd.database.coreos.com/v1beta2"
kind: "EtcdCluster"
metadata:
  name: "default-etcd-cluster"
  ## Adding this annotation make this cluster managed by clusterwide operators
  ## namespaced operators ignore it
  annotations:
    etcd.database.coreos.com/scope: clusterwide
  namespace: etcd   # 此处的ns表示etcd集群部署在哪个ns下
spec:
  size: 3
  version: "v3.3.18"
  repository: registry.cn-shenzhen.aliyuncs.com/huweihuang/etcd
  pod:
    busyboxImage: registry.cn-shenzhen.aliyuncs.com/huweihuang/busybox:1.28.0-glibc

查看集群部署结果

$ kgpo -n etcd
NAME                              READY   STATUS    RESTARTS   AGE
default-etcd-cluster-b6phnpf8z8   1/1     Running   0          3m3s
default-etcd-cluster-hhgq4sbtgr   1/1     Running   0          109s
default-etcd-cluster-ttfh5fj92b   1/1     Running   0          2m29s

4. 访问etcd集群

查看service

$ kgsvc -n etcd
NAME                          TYPE        CLUSTER-IP        EXTERNAL-IP   PORT(S)             AGE
default-etcd-cluster          ClusterIP   None              <none>        2379/TCP,2380/TCP   5m37s
default-etcd-cluster-client   ClusterIP   192.168.255.244   <none>        2379/TCP            5m37s

使用service地址访问

# 查看集群健康状态
$ ETCDCTL_API=3 etcdctl --endpoints 192.168.255.244:2379 endpoint health
192.168.255.244:2379 is healthy: successfully committed proposal: took = 1.96126ms

# 写数据
$ ETCDCTL_API=3 etcdctl --endpoints 192.168.255.244:2379 put foo bar
OK

# 读数据
$ ETCDCTL_API=3 etcdctl --endpoints 192.168.255.244:2379 get foo
foo
bar

5. 销毁etcd-operator

kubectl delete -f example/deployment.yaml
kubectl delete endpoints etcd-operator
kubectl delete crd etcdclusters.etcd.database.coreos.com
kubectl delete clusterrole etcd-operator
kubectl delete clusterrolebinding etcd-operator

参考:

14 - 多集群管理

14.1 - k8s多集群管理的思考

k8s多集群的思考

1. 为什么需要多集群

1、k8s单集群的承载能力有限。

Kubernetes v1.21 支持的最大节点数为 5000。 更具体地说,Kubernetes旨在适应满足以下所有标准的配置:

  • 每个节点的 Pod 数量不超过 100
  • 节点数不超过 5000
  • Pod 总数不超过 150000
  • 容器总数不超过 300000

参考:https://kubernetes.io/zh/docs/setup/best-practices/cluster-large/

且当节点数量较大时,会出现调度延迟,etcd读写延迟,apiserver负载高等问题,影响服务的正常创建。

2、分散集群服务风险。

全部服务都放在一个k8s集群中,当该集群出现异常,短期无法恢复的情况下,则影响全部服务和影响部署。为了避免机房等故障导致单集群异常,建议将k8s的master在分散在延迟较低的不同可用区部署,且在不同region部署多个k8s集群来进行集群级别的容灾。

3、当前混合云的使用方式和架构

当前部分公司会存在自建机房+不同云厂商的公有云从而来实现混部云的运营模式,那么自然会引入多集群管理的问题。

2. 多集群部署需要解决哪些问题

目标:让用户像使用单集群一样来使用多集群

扩展集群的边界,服务的边界从单台物理机多个进程,发展到通过k8s集群来管理多台的物理机,再发展到管理多个的k8s集群。服务的边界从物理机发展到集群

而多集群管理需要解决以下问题:

  • 多集群服务的分发部署(deployment、daemonset等)
  • 跨集群自动迁移与调度(当某个集群异常,服务可以在其他集群自动部署)
  • 多集群服务发现,网络通信及负载均衡(service,ingress等)

而多集群服务的网络通信可以由Service mesh等来解决,本文不做重点讨论。

以上几个问题,可以先从k8s管理节点的思维进行分析

物理机视角 单集群视角 多集群视角
进程的边界 物理机 k8s集群 多集群
调度单元 进程或线程 容器或pod 工作负载(deployment)
服务的集合 工作负载(deployment) 不同集群工作负载的集合体(workloadGroup)
服务发现 service 不同集群service的集合体
服务迁移 工作负载(deployment)控制器 不同集群工作负载的集合体控制器
服务调度 nodename或者node selector clustername或cluster selector
pod的反亲和(相同deployment下的pod不调度在相同节点) workload反亲和(相同workloadGroup分散在不同集群)

2.1. 多集群工作负载的分发

单集群中k8s的调度单元是pod,即一个pod只能跑在一个节点上,一个节点可以运行多个pod,而不同节点上的一组pod是通过一个workload来控制和分发。类似这个逻辑,那么在多集群的视角下,多集群的调度单元是一个集群的workload,一个workload只能跑在一个集群中,一个集群可以运行多个workload。

那么就需要有一个控制器来管理不同k8s集群的相同workload。例如 workloadGroup。而该workloadGroup在不侵入k8s原生API的情况下,主要包含两个部分。

workloadGroup:

  • 资源模板(Resource Template):服务的描述(workload)
  • 分发策略(Propagaion Policy):服务分发的集群(即多个workload应该被分发到哪些集群运行)

workload描述的是什么服务运行在什么节点,workloadGroup描述的是什么服务运行在什么集群。

实现workloadGroup有两种方式

  1. 一种是自定义API将workloadGroup中的Resource Template和Propagaion Policy合成在一个自定义的对象中,由用户直接指定该workloadGroup信息,从而将不同的workload分发到不同的集群中。
  2. 另一种方式是通过一个k8s载体来记录一个具体的workload对象,再由用户指定Propagaion Policy关联该workload对象,从而让控制器自动根据用户指定的Propagaion Policy将workload分发到不同的集群中。

2.2. 跨集群自动迁移与调度

单集群中k8s中通过workload中的nodeselector或者nodename以及亲和性来控制pod运行在哪个节点上。而多集群的视角下,则需要有一个控制器来实现集群级别的调度逻辑,例如clustername,cluster selector,cluster AntiAffinity,从而来自动控制workloadGroup下的workload分散在什么集群上。

3. 目前的多集群方案

3.1. Kubefed[Federation v2]

简介

基本思想

3.2. virtual kubelet

简介

基本思想

3.3. Karmada

简介

基本思想

参考:

14.2 - Virtual Kubelet

14.2.1 - Virtual Kubelet介绍

1. 简介

Virtual KubeletKubernetes kubelet 的一种实现,作为一种虚拟的kubelet用来连接k8s集群和其他平台的API。这允许k8s的节点由其他提供者(provider)提供支持,这些提供者例如serverless平台(ACI, AWS Fargate)、IoT Edge等。

一句话概括:Kubernetes API on top, programmable back。

2. 架构图

3. 功能

virtual kubelet提供一个可以自定义k8s node的依赖库。

目前支持的功能如下:

  • 创建、删除、更新 pod
  • 容器的日志、exec命令、metrics
  • 获取pod、pod列表、pod status
  • node的地址、容量、daemon
  • 操作系统
  • 自定义virtual network

4. Providers

virtual kubelet提供一个插件式的provider接口,让开发者可以自定义实现传统kubelet的功能。自定义的provider可以用自己的配置文件和环境参数。

自定义的provider必须提供以下功能:

  • 提供pod、容器、资源的生命周期管理的功能
  • 符合virtual kubelet提供的API
  • 不直接访问k8s apiserver,定义获取数据的回调机制,例如configmap、secrets

开源的provider

5. 自定义provider

创建自定义provider的目录。

git clone https://github.com/virtual-kubelet/virtual-kubelet
cd virtual-kubelet
mkdir providers/my-provider

5.1. PodLifecylceHandler

当pod被k8s创建、更新、删除时,会调用以下方法。

type PodLifecycleHandler interface {
    // CreatePod takes a Kubernetes Pod and deploys it within the provider.
    CreatePod(ctx context.Context, pod *corev1.Pod) error

    // UpdatePod takes a Kubernetes Pod and updates it within the provider.
    UpdatePod(ctx context.Context, pod *corev1.Pod) error

    // DeletePod takes a Kubernetes Pod and deletes it from the provider.
    DeletePod(ctx context.Context, pod *corev1.Pod) error

    // GetPod retrieves a pod by name from the provider (can be cached).
    GetPod(ctx context.Context, namespace, name string) (*corev1.Pod, error)

    // GetPodStatus retrieves the status of a pod by name from the provider.
    GetPodStatus(ctx context.Context, namespace, name string) (*corev1.PodStatus, error)

    // GetPods retrieves a list of all pods running on the provider (can be cached).
    GetPods(context.Context) ([]*corev1.Pod, error)
}

PodLifecycleHandler是被PodController来调用,来管理被分配到node上的pod。

pc, _ := node.NewPodController(podControllerConfig) // <-- instatiates the pod controller
pc.Run(ctx) // <-- starts watching for pods to be scheduled on the node

5.2. PodNotifier(optional)

PodNotifier是可选实现,该接口主要用来通知virtual kubelet的pod状态变化。如果没有实现该接口,virtual-kubelet会定期检查所有pod的状态。

type PodNotifier interface {
    // NotifyPods instructs the notifier to call the passed in function when
    // the pod status changes.
    //
    // NotifyPods should not block callers.
    NotifyPods(context.Context, func(*corev1.Pod))
}

5.3. NodeProvider

NodeProvider用来通知virtual-kubelet关于node状态的变化,virtual-kubelet会定期检查node是状态并相应地更新k8s。

type NodeProvider interface {
    // Ping checks if the node is still active.
    // This is intended to be lightweight as it will be called periodically as a
    // heartbeat to keep the node marked as ready in Kubernetes.
    Ping(context.Context) error

    // NotifyNodeStatus is used to asynchronously monitor the node.
    // The passed in callback should be called any time there is a change to the
    // node's status.
    // This will generally trigger a call to the Kubernetes API server to update
    // the status.
    //
    // NotifyNodeStatus should not block callers.
    NotifyNodeStatus(ctx context.Context, cb func(*corev1.Node))
}

NodeProvider是被NodeController调用,来管理k8s中的node对象。

nc, _ := node.NewNodeController(nodeProvider, nodeSpec) // <-- instantiate a node controller from a node provider and a kubernetes node spec
nc.Run(ctx) // <-- creates the node in kubernetes and starts up he controller

5.4. 测试

进入到项目根目录

make test

5.5. 示例代码

参考:

14.2.2 - Virtual Kubelet命令

virtual-kubelet --help

#./virtual-kubelet --help
virtual-kubelet implements the Kubelet interface with a pluggable
backend implementation allowing users to create kubernetes nodes without running the kubelet.
This allows users to schedule kubernetes workloads on nodes that aren't running Kubernetes.

Usage:
  virtual-kubelet [flags]
  virtual-kubelet [command]

Available Commands:
  help        Help about any command
  providers   Show the list of supported providers
  version     Show the version of the program

Flags:
      --cluster-domain string                 kubernetes cluster-domain (default is 'cluster.local') (default "cluster.local")
      --disable-taint                         disable the virtual-kubelet node taint
      --enable-node-lease                     use node leases (1.13) for node heartbeats
      --full-resync-period duration           how often to perform a full resync of pods between kubernetes and the provider (default 1m0s)
  -h, --help                                  help for virtual-kubelet
      --klog.alsologtostderr                  log to standard error as well as files
      --klog.log_backtrace_at traceLocation   when logging hits line file:N, emit a stack trace (default :0)
      --klog.log_dir string                   If non-empty, write log files in this directory
      --klog.log_file string                  If non-empty, use this log file
      --klog.log_file_max_size uint           Defines the maximum size a log file can grow to. Unit is megabytes. If the value is 0, the maximum file size is unlimited. (default 1800)
      --klog.logtostderr                      log to standard error instead of files (default true)
      --klog.skip_headers                     If true, avoid header prefixes in the log messages
      --klog.skip_log_headers                 If true, avoid headers when opening log files
      --klog.stderrthreshold severity         logs at or above this threshold go to stderr (default 2)
      --klog.v Level                          number for the log level verbosity
      --klog.vmodule moduleSpec               comma-separated list of pattern=N settings for file-filtered logging
      --kubeconfig string                     kube config file to use for connecting to the Kubernetes API server (default "/root/.kube/config")
      --log-level string                      set the log level, e.g. "debug", "info", "warn", "error" (default "info")
      --metrics-addr string                   address to listen for metrics/stats requests (default ":10255")
      --namespace string                      kubernetes namespace (default is 'all')
      --nodename string                       kubernetes node name (default "virtual-kubelet")
      --os string                             Operating System (Linux/Windows) (default "Linux")
      --pod-sync-workers int                  set the number of pod synchronization workers (default 10)
      --provider string                       cloud provider
      --provider-config string                cloud provider configuration file
      --startup-timeout duration              How long to wait for the virtual-kubelet to start
      --trace-exporter strings                sets the tracing exporter to use, available exporters: [jaeger ocagent]
      --trace-sample-rate string              set probability of tracing samples
      --trace-service-name string             sets the name of the service used to register with the trace exporter (default "virtual-kubelet")
      --trace-tag map                         add tags to include with traces in key=value form

Use "virtual-kubelet [command] --help" for more information about a command.

14.3 - Karmada

14.3.1 - Karmada介绍

本文由网络资源整理以作记录

简介

Karmada(Kubernetes Armada)是基于Kubernetes原生API的多集群管理系统。在多云和混合云场景下,Karmada提供可插拔,全自动化管理多集群应用,实现多云集中管理、高可用性、故障恢复和流量调度。

特性

  • 基于K8s原生API的跨集群应用管理,用户可以方便快捷地将应用从单集群迁移到多集群。
  • 中心式操作和管理Kubernetes集群。
  • 跨集群应用可在多集群上自动扩展,故障转移和负载均衡。
  • 高级的调度策略:区域,可用区,云提供商,集群亲和性/反亲和性。
  • 支持创建分发用户自定义(CustomResourceDefinitions)资源。

框架结构

img

  • ETCD:存储Karmada API对象。
  • Karmada Scheduler:提供高级的多集群调度策略。
  • Karmada Controller Manager: 包含多个Controller,Controller监听karmada对象并且与成员集群API server进行通信并创建成员集群的k8s对象。
    • Cluster Controller:成员集群的生命周期管理与对象管理。
    • Policy Controller:监听PropagationPolicy对象,创建ResourceBinding,配置资源分发策略。
    • Binding Controller:监听ResourceBinding对象,并创建work对象响应资源清单。
    • Execution Controller:监听work对象,并将资源分发到成员集群中。

资源分发流程

基本概念

  • 资源模板(Resource Template):Karmada使用K8s原生API定义作为资源模板,便于快速对接K8s生态工具链。
  • 分发策略(Propagaion Policy):Karmada提供独立的策略API,用来配置资源分发策略。
  • 差异化策略(Override Policy):Karmada提供独立的差异化API,用来配置与集群相关的差异化配置。比如配置不同集群使用不同的镜像。

Karmada资源分发流程图:

img

参考:

15 - 边缘容器

15.1 - KubeEdge

15.1.1 - KubeEdge介绍

1. KubeEdge简介

KubeEdge是基于kubernetes之上将容器化应用的编排能力拓展到边缘主机或边缘设备,在云端和边缘端提供网络通信,应用部署、元数据同步等功能。同时支持MQTT协议,允许开发者在边缘端自定义接入边缘设备。

2. 功能

  • 边缘计算:提供边缘节点自治能力,边缘节点数据处理能力。
  • 便捷部署:开发者可以开发http或mqtt协议的应用,运行在云端和边缘端。
  • k8s原生支持:可以通过k8s管理和监控边缘设备和边缘节点。
  • 丰富的应用类型:可以在边缘端部署机器学习、图片识别、事件处理等应用。

3. 组件

3.1. 云端

  • CloudHub:一个web socket服务器,负责监听云端的更新、缓存及向EdgeHub发送消息。

  • EdgeController:一个扩展的k8s控制器,负责管理边缘节点和pod元数据,同步边缘节点的数据,是k8s-apiserverEdgeCore的通信桥梁。

  • DeviceController:一个扩展的k8s控制器,负责管理节点设备,同步云端和边缘端的设备元数据和状态。

3.2. 边缘端

  • EdgeHub:一个web socket客户端,负责云端与边缘端的信息交互,其中包括将云端的资源变更同步到边缘端及边缘端的状态变化同步到云端。
  • Edged:运行在边缘节点,管理容器化应用的agent,负责pod生命周期的管理,类似kubelet。
  • EventBus:一个MQTT客户端,与MQTT服务端交互,提供发布/订阅的能力。
  • ServiceBus:一个HTTP客户端,与HTTP服务端交互。为云组件提供HTTP客户端功能,以访问在边缘运行的HTTP服务器。
  • DeviceTwin:负责存储设备状态并同步设备状态到云端,同时提供应用的接口查询。
  • MetaManageredgededgehub之间的消息处理器,负责向轻量数据库(SQLite)存储或查询元数据。

4. 架构图

kubeedge-arch

参考:

15.1.2 - KubeEdge源码分析

15.1.2.1 -

kubeedge源码分析之cloudcore

本文源码分析基于kubeedge v1.1.0

本文主要分析cloudcore中CloudCoreCommand的基本流程,具体的cloudhubedgecontrollerdevicecontroller模块的实现逻辑待后续单独文章分析。

目录结构:

cloud/cmd/cloudcore

cloudcore
├── app
│   ├── options
│   │   └── options.go
│   └── server.go # NewCloudCoreCommand、registerModules
└── cloudcore.go # main函数

cloudcore部分包含以下模块:

  • cloudhub
  • edgecontroller
  • devicecontroller

1. main函数

kubeedge的代码采用cobra命令框架,代码风格与k8s源码风格类似。cmd目录主要为cobra command的基本内容及参数解析,pkg目录包含具体的实现逻辑。

cloud/cmd/cloudcore/cloudcore.go

func main() {
	command := app.NewCloudCoreCommand()
	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		os.Exit(1)
	}
}

2. NewCloudCoreCommand

NewCloudCoreCommand为cobra command的构造函数,该类函数一般包含以下部分:

  • 构造option
  • 添加Flags
  • 运行Run函数(核心)

cloud/cmd/cloudcore/app/server.go

func NewCloudCoreCommand() *cobra.Command {
	opts := options.NewCloudCoreOptions()
	cmd := &cobra.Command{
		Use: "cloudcore",
		Long: `CloudCore is the core cloud part of KubeEdge, which contains three modules: cloudhub,
edgecontroller, and devicecontroller. Cloudhub is a web server responsible for watching changes at the cloud side,
caching and sending messages to EdgeHub. EdgeController is an extended kubernetes controller which manages 
edge nodes and pods metadata so that the data can be targeted to a specific edge node. DeviceController is an extended 
kubernetes controller which manages devices so that the device metadata/status date can be synced between edge and cloud.`,
		Run: func(cmd *cobra.Command, args []string) {
			verflag.PrintAndExitIfRequested()
			flag.PrintFlags(cmd.Flags())

			// To help debugging, immediately log version
			klog.Infof("Version: %+v", version.Get())
			registerModules()
			// start all modules
			core.Run()
		},
	}
	fs := cmd.Flags()
	namedFs := opts.Flags()
	verflag.AddFlags(namedFs.FlagSet("global"))
	globalflag.AddGlobalFlags(namedFs.FlagSet("global"), cmd.Name())
	for _, f := range namedFs.FlagSets {
		fs.AddFlagSet(f)
	}

	usageFmt := "Usage:\n  %s\n"
	cols, _, _ := term.TerminalSize(cmd.OutOrStdout())
	cmd.SetUsageFunc(func(cmd *cobra.Command) error {
		fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
		cliflag.PrintSections(cmd.OutOrStderr(), namedFs, cols)
		return nil
	})
	cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
		fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
		cliflag.PrintSections(cmd.OutOrStdout(), namedFs, cols)
	})

	return cmd
}

核心代码:

// 构造option
opts := options.NewCloudCoreOptions()
// 执行run函数
registerModules()
core.Run()
// 添加flags
fs.AddFlagSet(f)

3. registerModules

由于kubeedge的代码的大部分模块都采用了基于go-channel的消息通信框架Beehive(待后续单独文章分析),因此在各模块启动之前,需要将该模块注册到beehive的框架中。

其中cloudcore部分涉及的模块有:

  • cloudhub
  • edgecontroller
  • devicecontroller

cloud/cmd/cloudcore/app/server.go

// registerModules register all the modules started in cloudcore
func registerModules() {
	cloudhub.Register()
	edgecontroller.Register()
	devicecontroller.Register()
}

以下以cloudhub为例说明注册的过程。

cloudhub结构体主要包含:

  • context:上下文,用来传递消息上下文
  • stopChan:go channel通信

beehive框架中的模块需要实现Module接口,因此cloudhub也实现了该接口,其中核心方法为Start,用来启动相应模块的运行。

vendor/github.com/kubeedge/beehive/pkg/core/module.go

// Module interface
type Module interface {
	Name() string
	Group() string
	Start(c *context.Context)
	Cleanup()
}

以下为cloudHub结构体及注册函数。

cloud/pkg/cloudhub/cloudhub.go

type cloudHub struct {
	context  *context.Context
	stopChan chan bool
}

func Register() {
	core.Register(&cloudHub{})
}

具体的注册实现函数为core.Register,注册过程实际上就是将具体的模块结构体放入一个以模块名为key的map映射中,待后续调用。

vendor/github.com/kubeedge/beehive/pkg/core/module.go

// Register register module
func Register(m Module) {
	if isModuleEnabled(m.Name()) {
		modules[m.Name()] = m  //将具体的模块结构体放入一个以模块名为key的map映射中
		log.LOGGER.Info("module " + m.Name() + " registered")
	} else {
		disabledModules[m.Name()] = m
		log.LOGGER.Info("module " + m.Name() +
			" is not register, please check modules.yaml")
	}
}

4. core.Run

CloudCoreCommand命令的Run函数实际上是运行beehive框架中注册的所有模块。

其中包括两部分逻辑:

  • 启动运行所有注册模块
  • 监听信号并做优雅清理

vendor/github.com/kubeedge/beehive/pkg/core/core.go

//Run starts the modules and in the end does module cleanup
func Run() {
	//Address the module registration and start the core
	StartModules()
	// monitor system signal and shutdown gracefully
	GracefulShutdown()
}

5. StartModules

StartModules获取context上下文,并以goroutine的方式运行所有已注册的模块。其中Start函数即每个模块的具体实现Module接口中的Start方法。不同模块各自定义自己的具体Start方法实现。

coreContext := context.GetContext(context.MsgCtxTypeChannel)
go module.Start(coreContext)

具体实现如下:

vendor/github.com/kubeedge/beehive/pkg/core/core.go

// StartModules starts modules that are registered
func StartModules() {
	coreContext := context.GetContext(context.MsgCtxTypeChannel)

	modules := GetModules()
	for name, module := range modules {
		//Init the module
		coreContext.AddModule(name)
		//Assemble typeChannels for send2Group
		coreContext.AddModuleGroup(name, module.Group())
		go module.Start(coreContext)
		log.LOGGER.Info("starting module " + name)
	}
}

6. GracefulShutdown

当收到相关信号,则执行各个模块实现的Cleanup方法。

vendor/github.com/kubeedge/beehive/pkg/core/core.go

// GracefulShutdown is if it gets the special signals it does modules cleanup
func GracefulShutdown() {
	c := make(chan os.Signal)
	signal.Notify(c, syscall.SIGINT, syscall.SIGHUP, syscall.SIGTERM,
		syscall.SIGQUIT, syscall.SIGILL, syscall.SIGTRAP, syscall.SIGABRT)
	select {
	case s := <-c:
		log.LOGGER.Info("got os signal " + s.String())
		//Cleanup each modules
		modules := GetModules()
		for name, module := range modules {
			log.LOGGER.Info("Cleanup module " + name)
			module.Cleanup()
		}
	}
}

参考:

15.1.2.2 -

kubeedge源码分析之edgecore

本文源码分析基于kubeedge v1.1.0

本文主要分析edgecoreEdgeCoreCommand的基本流程,具体的edgededgehubmetamanager等模块的实现逻辑待后续单独文章分析。

目录结构:

edgecore
├── app
│   ├── options
│   │   └── options.go
│   └── server.go  # NewEdgeCoreCommand 、registerModules
└── edgecore.go  # main

edgecore模块包含:

  • edged
  • edgehub
  • metamanager
  • eventbus
  • servicebus
  • devicetwin
  • edgemesh

1. main函数

main入口函数,仍然是cobra命令框架格式。

edge/cmd/edgecore/edgecore.go

func main() {
	command := app.NewEdgeCoreCommand()
	logs.InitLogs()
	defer logs.FlushLogs()

	if err := command.Execute(); err != nil {
		os.Exit(1)
	}
}

2. NewEdgeCoreCommand

NewEdgeCoreCommandNewCloudCoreCommand一样构造对应的cobra command结构体。

edge/cmd/edgecore/app/server.go

// NewEdgeCoreCommand create edgecore cmd
func NewEdgeCoreCommand() *cobra.Command {
	opts := options.NewEdgeCoreOptions()
	cmd := &cobra.Command{
		Use: "edgecore",
		Long: `Edgecore is the core edge part of KubeEdge, which contains six modules: devicetwin, edged, 
edgehub, eventbus, metamanager, and servicebus. DeviceTwin is responsible for storing device status 
and syncing device status to the cloud. It also provides query interfaces for applications. Edged is an 
agent that runs on edge nodes and manages containerized applications and devices. Edgehub is a web socket 
client responsible for interacting with Cloud Service for the edge computing (like Edge Controller as in the KubeEdge 
Architecture). This includes syncing cloud-side resource updates to the edge, and reporting 
edge-side host and device status changes to the cloud. EventBus is a MQTT client to interact with MQTT 
servers (mosquito), offering publish and subscribe capabilities to other components. MetaManager 
is the message processor between edged and edgehub. It is also responsible for storing/retrieving metadata 
to/from a lightweight database (SQLite).ServiceBus is a HTTP client to interact with HTTP servers (REST), 
offering HTTP client capabilities to components of cloud to reach HTTP servers running at edge. `,
		Run: func(cmd *cobra.Command, args []string) {
			verflag.PrintAndExitIfRequested()
			flag.PrintFlags(cmd.Flags())

			// To help debugging, immediately log version
			klog.Infof("Version: %+v", version.Get())

			registerModules()
			// start all modules
			core.Run()
		},
	}
	fs := cmd.Flags()
	namedFs := opts.Flags()
	verflag.AddFlags(namedFs.FlagSet("global"))
	globalflag.AddGlobalFlags(namedFs.FlagSet("global"), cmd.Name())
	for _, f := range namedFs.FlagSets {
		fs.AddFlagSet(f)
	}

	usageFmt := "Usage:\n  %s\n"
	cols, _, _ := term.TerminalSize(cmd.OutOrStdout())
	cmd.SetUsageFunc(func(cmd *cobra.Command) error {
		fmt.Fprintf(cmd.OutOrStderr(), usageFmt, cmd.UseLine())
		cliflag.PrintSections(cmd.OutOrStderr(), namedFs, cols)
		return nil
	})
	cmd.SetHelpFunc(func(cmd *cobra.Command, args []string) {
		fmt.Fprintf(cmd.OutOrStdout(), "%s\n\n"+usageFmt, cmd.Long, cmd.UseLine())
		cliflag.PrintSections(cmd.OutOrStdout(), namedFs, cols)
	})

	return cmd
}

核心代码:

opts := options.NewEdgeCoreOptions()
registerModules()
core.Run()

3. registerModules

edgecore仍然采用Beehive通信框架,模块调用前先注册对应的模块。具体参考cloudcore.registerModules处的分析,此处不再展开分析注册流程。此处注册的是edgecore中涉及的组件。

edge/cmd/edgecore/app/server.go

// registerModules register all the modules started in edgecore
func registerModules() {
	devicetwin.Register()
	edged.Register()
	edgehub.Register()
	eventbus.Register()
	edgemesh.Register()
	metamanager.Register()
	servicebus.Register()
	test.Register()
	dbm.InitDBManager()
}

4. core.Run

core.Run与cloudcore.run处逻辑一致不再展开分析。

vendor/github.com/kubeedge/beehive/pkg/core/core.go

//Run starts the modules and in the end does module cleanup
func Run() {
   //Address the module registration and start the core
   StartModules()
   // monitor system signal and shutdown gracefully
   GracefulShutdown()
}

参考:

15.2 - OpenYurt

15.2.1 - OpenYurt部署

本文主要介绍部署openyurt组件到k8s集群中。

1. 给云端节点和边缘节点打标签

openyurt将k8s节点分为云端节点和边缘节点,云端节点主要运行一些云端的业务,边缘节点运行边缘业务。当与 apiserver 断开连接时,只有运行在边缘自治的节点上的Pod才不会被驱逐。通过打 openyurt.io/is-edge-worker 的标签的方式来区分,false表示云端节点,true表示边缘节点。

云端组件:

  • yurt-controller-manager

  • yurt-tunnel-server

边缘组件:

  • yurt-hub

  • yurt-tunnel-agent

1.1. openyurt.io/is-edge-worker节点标签

# 云端节点,值为false
kubectl label node us-west-1.192.168.0.87 openyurt.io/is-edge-worker=false

# 边缘节点,值为true
kubectl label node us-west-1.192.168.0.88 openyurt.io/is-edge-worker=true

1.2. 给边缘节点开启自治模式

kubectl annotate node us-west-1.192.168.0.88 node.beta.openyurt.io/autonomy=true

2. 部署 Yurt-controller-manager(cloud)

yurt-controller-manager用来避免节点与apiserver失联时,自治边缘节点pod被驱逐。

wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurt-controller-manager.yaml
kubectl apply -f yurt-controller-manager.yaml

yaml文件位于https://github.com/openyurtio/openyurt/tree/master/config/setup

禁用默认的 nodelifecycle 控制器

nodelifecycle控制器主要用来根据node的status及lease的更新时间来决定是否要驱逐节点上的pod。为了让 yurt-controller-mamanger 能够正常工作,因此需要禁用controller的驱逐功能。

vim /etc/kubernetes/manifests/kube-controller-manager.yaml
# 在--controllers=*,bootstrapsigner,tokencleaner后面添加,-nodelifecycle 
# 即参数为: --controllers=*,bootstrapsigner,tokencleaner,-nodelifecycle

# 如果kube-controller-manager是以static pod部署,修改yaml文件后会自动重启。

3. 部署 Yurthub(edge)

在 yurt-controller-manager 启动并正常运行后,以静态 pod 的方式部署 Yurthub

  1. 为 yurthub 创建全局配置(即RBAC, configmap)
wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurthub-cfg.yaml
kubectl apply -f yurthub-cfg.yaml
  1. 在边缘节点以static pod方式创建yurthub
mkdir -p /etc/kubernetes/manifests/
cd /etc/kubernetes/manifests/
wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurthub.yaml 

# 获取bootstrap token
kubeadm token create

# 假设 apiserver 的地址是 1.2.3.4:6443,bootstrap token 是 07401b.f395accd246ae52d
sed -i 's|__kubernetes_master_address__|1.2.3.4:6443|;
s|__bootstrap_token__|07401b.f395accd246ae52d|' /etc/kubernetes/manifests/yurthub.yaml

4. 重置 Kubelet

重置 kubelet 服务,让它通过 yurthub 访问apiserver。为 kubelet 服务创建一个新的 kubeconfig 文件来访问apiserver。

mkdir -p /var/lib/openyurt
cat << EOF > /var/lib/openyurt/kubelet.conf
apiVersion: v1
clusters:
- cluster:
    server: http://127.0.0.1:10261
  name: default-cluster
contexts:
- context:
    cluster: default-cluster
    namespace: default
    user: default-auth
  name: default-context
current-context: default-context
kind: Config
preferences: {}
EOF

修改/etc/systemd/system/kubelet.service.d/10-kubeadm.conf

sed -i "s|KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=\/etc\/kubernetes\/bootstrap-kubelet.conf\ --kubeconfig=\/etc\/kubernetes\/kubelet.conf|KUBELET_KUBECONFIG_ARGS=--kubeconfig=\/var\/lib\/openyurt\/kubelet.conf|g" \
    /etc/systemd/system/kubelet.service.d/10-kubeadm.conf

重启kubelet服务

systemctl daemon-reload && systemctl restart kubelet

5. 部署 Yurt-tunnel (可选)

5.1. 部署云端的 yurt-tunnel-server

wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurt-tunnel-server.yaml
kubectl apply -f yurt-tunnel-server.yaml

5.2. 部署边缘的yurt-tunnel-agent

wget https://raw.githubusercontent.com/openyurtio/openyurt/master/config/setup/yurt-tunnel-agent.yaml
kubectl apply -f yurt-tunnel-agent.yaml

由于yurt-tunnel-server默认使用host模式,因此可能存在边缘端的agent无法访问云端的tunnel-server,需要为tunnel-server配置一个可访问的地址。

参考:

16 - 虚拟化

16.1 - 虚拟化相关概念

1. 虚拟化

借助虚拟化技术,用户能以单个物理硬件系统为基础,创建多个模拟环境或专用资源,并使用一款名为“Hypervisor”(虚拟机监控程序)的软件直接连接到硬件,从而将一个系统划分为不同、单独而安全的环境,即虚拟机 (VM)。

虚拟化技术可以重新划分IT资源,提高资源的利用率。

2. 虚拟化的类型

全虚拟化(Full virtualization)

全虚拟化使用未修改的guest操作系统版本,guest直接与CPU通信,是最快的虚拟化方法。

半虚拟化(Paravirtualization)

半虚拟化使用修改过的guest操作系统,guest与hypervisor通信,hypervisor将guest的调用传递给CPU和其他接口。因为通信经过hypervisor,因此比全虚拟化慢。

3. hypervisor

hypervisor又称为 virtual machine monitor (VMM),是一个创建和运行虚拟机的程序。被 hypervisor 用来运行一个或多个虚拟机的计算机称为宿主机(host machine),这些虚拟机则称为客户机(guest machine)。

4. kvm

kvm(Kernel-based Virtual Machine)是Linux内核的虚拟化模块,可以利用Linux内核的功能来作为hypervisor。

KVM本身不进行模拟,而是暴露一个/dev/kvm接口。

使用KVM,可以在Linux的镜像上

5. qemu

QEMU(quick emulator)

待补充

6. libvirt

libvirt是一个硬件虚拟化的管理工具API,可用于KVM、QEMU等虚拟化技术,

参考:

16.2 - KubeVirt

16.2.1 - KubeVirt的介绍

本文主要由云原生虚拟化:基于 Kubevirt 构建边缘计算实例文章重新整理而成。

1. kubevirt简介

kubevirt是基于k8s之上,提供了一种通过k8s来编排和管理虚拟机的方式。

2. 架构图

arch

2.1. 组件说明

分类 组件 部署方式 功能说明
控制面 virt-api deployment 自定义API,开机、关机、重启等,作为apiserver的插件,业务通过k8s apiserver请求virt-api。
virt-controller deployment 管理和监控VMI对象的状态,控制VMI下的pod。
节点侧 virt-handler daemonset 类似kubelet,管理宿主机上的所有虚拟机实例。
virt-launcher virt-handler pod 调用libvirt和qemu创建虚拟机进程。

virt-launcher与libvirt逻辑:

2.2. 自定义CRD对象

分类 CRD对象 功能说明
虚机 VirtualMachineInstance(VMI) 代表运行的虚拟机实例
VirtualMachine(VM) 虚机对象,提供开机、关机、重启,管理VMI实例,与VMI的关系是1:1

3. 创建虚拟机流程

待补充

参考:

16.2.2 - KubeVirt的使用

1. 安装kubevirt

1.1. 修改镜像仓库

针对私有环境,需要将所需镜像上传到自己的镜像仓库中。

涉及的镜像组件有

virt-operator
virt-api
virt-controller
virt-launcher

重命名镜像脚本如下:

#!/bin/bash

# kubevirt组件版本
version=$1

# 私有镜像仓库
registry=$2

# 私有镜像仓库的namespace
namespace=$3

kubevirtRegistry="quay.io/kubevirt"

readonly APPLIST=(
    virt-operator
    virt-api
    virt-controller
    virt-launcher
)

for app in "${APPLIST[@]}"; do
    # 拉取镜像
    docker pull ${kubevirtRegistry}/${app}:${version}
    # 重命名
    docker tag ${kubevirtRegistry}/${app}:${version} ${registry}/${namespace}/${app}:${version}
    # 推送镜像
    docker push ${registry}/${namespace}/${app}:${version}
done

echo "重新命名成功"

1.2. 部署virt-operator

通过kubevirt operator安装kubevirt相关组件,选择指定版本,下载kubevirt-operator.yamlkubevirt-cr.yaml文件,并创建k8s相关对象。

如果是私有镜像仓库,则需要将kubevirt-operator.yaml文件中镜像的名字替换为私有镜像仓库的地址,并提前按步骤1推送所需镜像到私有镜像仓库。

# Pick an upstream version of KubeVirt to install
$ export RELEASE=v0.52.0
# Deploy the KubeVirt operator
$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-operator.yaml
# Create the KubeVirt CR (instance deployment request) which triggers the actual installation
$ kubectl apply -f https://github.com/kubevirt/kubevirt/releases/download/${RELEASE}/kubevirt-cr.yaml
# wait until all KubeVirt components are up
$ kubectl -n kubevirt wait kv kubevirt --for condition=Available

1.3. 部署virtctl

virtctl用来启动和关闭虚拟机。

VERSION=$(kubectl get kubevirt.kubevirt.io/kubevirt -n kubevirt -o=jsonpath="{.status.observedKubeVirtVersion}")
ARCH=$(uname -s | tr A-Z a-z)-$(uname -m | sed 's/x86_64/amd64/') || windows-amd64.exe
echo ${ARCH}
curl -L -o virtctl https://github.com/kubevirt/kubevirt/releases/download/${VERSION}/virtctl-${VERSION}-${ARCH}
chmod +x virtctl
sudo install virtctl /usr/local/bin

2. kubevirt部署产物

通过手动部署virt-operator,会自动部署以下组件

组件 部署方式 副本数
virt-api deployment 2
virt-controller deployment 2
virt-handler daemonset -

具体参考:

#kg all -n kubevirt
NAME                                   READY   STATUS    RESTARTS   AGE
pod/virt-api-5fb5cffb7f-hgjjh          1/1     Running   0          23h
pod/virt-api-5fb5cffb7f-jcp7x          1/1     Running   0          23h
pod/virt-controller-844cd4f58c-h8vsx   1/1     Running   0          23h
pod/virt-controller-844cd4f58c-vlxqs   1/1     Running   0          23h
pod/virt-handler-lb5ft                 1/1     Running   0          23h
pod/virt-handler-mtr4d                 1/1     Running   0          22h
pod/virt-handler-sxd2t                 1/1     Running   0          23h
pod/virt-operator-8595f577cd-b9txg     1/1     Running   0          23h
pod/virt-operator-8595f577cd-p2f69     1/1     Running   0          23h

NAME                                  TYPE        CLUSTER-IP       EXTERNAL-IP   PORT(S)   AGE
service/kubevirt-operator-webhook     ClusterIP   10.254.159.81    <none>        443/TCP   23h
service/kubevirt-prometheus-metrics   ClusterIP   10.254.7.231     <none>        443/TCP   23h
service/virt-api                      ClusterIP   10.254.244.139   <none>        443/TCP   23h

NAME                          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/virt-handler   3         3         3       3            3           kubernetes.io/os=linux   23h

NAME                              READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/virt-api          2/2     2            2           23h
deployment.apps/virt-controller   2/2     2            2           23h
deployment.apps/virt-operator     2/2     2            2           23h

NAME                                         DESIRED   CURRENT   READY   AGE
replicaset.apps/virt-api-5fb5cffb7f          2         2         2       23h
replicaset.apps/virt-controller-844cd4f58c   2         2         2       23h
replicaset.apps/virt-operator-8595f577cd     2         2         2       23h

NAME                            AGE   PHASE
kubevirt.kubevirt.io/kubevirt   23h   Deployed

3. 创建虚拟机

通过vm.yaml创建虚拟机

# 下载vm.yaml
wget https://kubevirt.io/labs/manifests/vm.yaml
# 创建虚拟机
kubectl apply -f https://kubevirt.io/labs/manifests/vm.yaml

vm.yaml文件

apiVersion: kubevirt.io/v1
kind: VirtualMachine
metadata:
  name: testvm
spec:
  running: false
  template:
    metadata:
      labels:
        kubevirt.io/size: small
        kubevirt.io/domain: testvm
    spec:
      domain:
        devices:
          disks:
            - name: containerdisk
              disk:
                bus: virtio
            - name: cloudinitdisk
              disk:
                bus: virtio
          interfaces:
          - name: default
            masquerade: {}
        resources:
          requests:
            memory: 64M
      networks:
      - name: default
        pod: {}
      volumes:
        - name: containerdisk
          containerDisk:
            image: quay.io/kubevirt/cirros-container-disk-demo
        - name: cloudinitdisk
          cloudInitNoCloud:
            userDataBase64: SGkuXG4=

查看虚拟机

kubectl get vms
kubectl get vms -o yaml testvm

启动或暂停虚拟机

# 启动虚拟机
virtctl start testvm
# 关闭虚拟机
virtctl stop testvm
# 进入虚拟机
virtctl console testvm

删除虚拟机

kubectl delete vm testvm

参考:

17 - 监控体系

17.1 - Kubernetes集群监控

1. 概述

1.1. cAdvisor

cAdvisor对Node机器上的资源及容器进行实时监控和性能数据采集,包括CPU使用情况、内存使用情况、网络吞吐量及文件系统使用情况,cAdvisor集成在Kubelet中,当kubelet启动时会自动启动cAdvisor,即一个cAdvisor仅对一台Node机器进行监控。kubelet的启动参数--cadvisor-port可以定义cAdvisor对外提供服务的端口,默认为4194。可以通过浏览器Node_IP:port访问。项目主页:http://github.com/google/cadvisor。

1.2. Heapster

是对集群中的各个Node、Pod的资源使用数据进行采集,通过访问每个Node上Kubelet的API,再通过Kubelet调用cAdvisor的API来采集该节点上所有容器的性能数据。由Heapster进行数据汇聚,保存到后端存储系统中,例如InfluxDB,Google Cloud Logging等。项目主页为:https://github.com/kubernetes/heapster。

1.3. InfluxDB

是分布式时序数据库(每条记录带有时间戳属性),主要用于实时数据采集、事件跟踪记录、存储时间图表、原始数据等。提供REST API用于数据的存储和查询。项目主页为http://InfluxDB.com。

1.4. Grafana

通过Dashboard将InfluxDB的时序数据展现成图表形式,便于查看集群运行状态。项目主页为http://Grafana.org。

1.5. 总体架构图

k8s监控架构图

其中当前Kubernetes中,Heapster、InfluxDB、Grafana均以Pod的形式启动和运行。Heapster与Master需配置安全连接。

2. 部署与使用

2.1. cAdvisor

kubelet的启动参数--cadvisor-port可以定义cAdvisor对外提供服务的端口,默认为4194。可以通过浏览器Node_IP:port访问。也提供了REST API供客户端远程调用,API返回的格式为JSON,可以采用URL访问:http://hostname:port/api/version/request/

例如:http://14.152.49.100:4194/api/v1.3/machine 获取主机信息。

2.2. Service

2.2.1. heapster-service

heapster-service.yaml

apiVersion:v1
kind:Service
metadata:
  label:
    kubenetes.io/cluster-service:"true"
    kubernetes.io/name:Heapster
  name:heapster
  namespace:kube-system
spec:
  ports:
    - port:80
      targetPort:8082
  selector:
    k8s-app:heapster

2.2.2. influxdb-service

influxdb-service.yaml

apiVersion:v1
kind:Service
metadata:
  label:null
  name:monitoring-InfluxDB
  namespace:kube-system
spec:
  type:Nodeport
  ports:
    - name:http
      port:80
      targetPort:8083
    - name:api
      port:8086
      targetPort:8086
      Nodeport:8086
  selector:
    name:influxGrafana

2.2.3. grafana-service

grafana-service.yaml

apiVersion:v1
kind:Service
metadata:
  label:
    kubenetes.io/cluster-service:"true"
    kubernetes.io/name:monitoring-Grafana
  name:monitoring-Grafana
  namespace:kube-system
spec:
  type:Nodeport
  ports:
      port:80
      targetPort:8080
      Nodeport:8085
  selector:
    name:influxGrafana

使用type=NodePort将InfluxDB和Grafana暴露在Node的端口上,以便通过浏览器进行访问。

2.2.4. 创建service

kubectl create -f heapster-service.yaml
kubectl create -f InfluxDB-service.yaml
kubectl create -f Grafana-service.yaml

2.3. ReplicationController

2.3.1. influxdb-grafana-controller

influxdb-grafana-controller-v3.yaml

apiVersion:v1
kind:ReplicationController
metadata:
  name:monitoring-influxdb-grafana-v3
  namespace:kube-system
  labels:
    k8s-app:influxGrafana
    version:v3
    kubernetes.io/cluster-service:"true
spec:
  replicas:1
  selector:
    k8s-app:influxGrafana
    version:v3
  template:
    metadata:
      labels:
        k8s-app:influxGrafana
        version:v3
        kubernetes.io/cluster-service:"true
    spec:
      containers:
        - image:gcr.io/google_containers/heapster_influxdb:v0.5
          name:influxdb
          resources:
            limits:
              cpu:100m
              memory:500Mi
            requests:
              cpu:100m
              memory:500Mi
          ports:
            - containerPort:8083
            - containerPort:8086
          volumeMounts:
            -name:influxdb-persistent-storage
             mountPath:/data
        - image:grc.io/google_containers/heapster_grafana:v2.6.0-2
          name:grafana
          resources:
            limits:
              cpu:100m
              memory:100Mi
            requests:
              cpu:100m
              memory:100Mi
          env:
            - name:INFLUXDB_SERVICE_URL
              value:http://monitoring-influxdb:8086
            - name:GF_AUTH_BASIC_ENABLED
              value:"false"
            - name:GF_AUTH_ANONYMOUS_ENABLED
              value:"true"
            - name:GF_AUTH_ANONYMOUS_ORG_ROLE
              value:Admin
            - name:GF_SERVER_ROOT_URL
              value:/api/v1/proxy/namespace/kube-system/services/monitoring-grafana/
          volumeMounts:
            - name:grafana-persistent-storage
              mountPath:/var
      volumes:
        - name:influxdb-persistent-storage
          emptyDir{}
        - name:grafana-persistent-storage
          emptyDir{}

2.3.2. heapster-controller

heapster-controller.yaml

apiVersion:v1
kind:ReplicationController
metadata:
    labels:
        k8s-app:heapster
        name:heapster
        version:v6
    name:heapster
    namespace:kube-system
spec:
    replicas:1
    selector:
        name:heapster
        k8s-app:heapster
        version:v6
    template:
        metadata:
            labels:
                k8s-app:heapster
                version:v6
        spec:
            containers:
                - image:gcr.io/google_containers/heapster:v0.17.0
                  name:heapster
                  command:
                    - /heapster
                    - --source=kubernetes:http://192.168.1.128:8080?inClusterConfig=flase&kubeletHttps=true&useServiceAccount=true&auth=
                    - --sink=InfluxDB:http://monitoring-InfluxDB:8086

Heapster设置启动参数说明:

1、–source

配置监控来源,本例中表示从k8s-Master获取各个Node的信息。在URL的参数部分,修改kubeletHttps、inClusterConfig、useServiceAccount的值。

2、–sink

配置后端的存储系统,本例中使用InfluxDB。URL中主机名的地址是InfluxDB的Service名字,需要DNS服务正常工作,如果没有配置DNS服务可使用Service的ClusterIP地址。

2.3.3. 创建ReplicationController

kubelet create -f InfluxDB-Grafana-controller.yaml
kubelet create -f heapster-controller.yaml

3. 查看界面及数据

3.1. InfluxDB

访问任意一台Node机器的30083端口。

3.2. Grafana

访问任意一台Node机器的30080端口。

4. 容器化部署

4.1. 拉取镜像

docker pull influxdb:latest
docker pull cadvisor:latest
docker pull grafana:latest
docker pull heapster:latest

4.2. 运行容器

4.2.1. influxdb

#influxdb
docker run -d -p 8083:8083 -p 8086:8086 --expose 8090 --expose 8099 --volume=/opt/data/influxdb:/data --name influxsrv influxdb:latest

4.2.2. cadvisor

#cadvisor
docker run --volume=/:/rootfs:ro --volume=/var/run:/var/run:rw --volume=/sys:/sys:ro --volume=/var/lib/docker/:/var/lib/docker:ro --publish=8080:8080 --detach=true --link influxsrv:influxsrv --name=cadvisor cadvisor:latest -storage_driver=influxdb -storage_driver_db=cadvisor -storage_driver_host=influxsrv:8086

4.2.3. grafana

#grafana
docker run -d -p 3000:3000 -e INFLUXDB_HOST=influxsrv -e INFLUXDB_PORT=8086 -e INFLUXDB_NAME=cadvisor -e INFLUXDB_USER=root -e INFLUXDB_PASS=root --link influxsrv:influxsrv --name grafana grafana:latest

4.2.4. heapster

docker run -d -p 8082:8082 --net=host heapster:canary --source=kubernetes:http://`k8s-server-ip`:8080?inClusterConfig=false/&useServiceAccount=false --sink=influxdb:http://`influxdb-ip`:8086

4.3. 访问

在浏览器输入IP:PORT

17.2 - cAdvisor介绍

1. cAdvisor简介

​ cAdvisor对Node机器上的资源及容器进行实时监控和性能数据采集,包括CPU使用情况、内存使用情况、网络吞吐量及文件系统使用情况,cAdvisor集成在Kubelet中,当kubelet启动时会自动启动cAdvisor,即一个cAdvisor仅对一台Node机器进行监控。kubelet的启动参数--cadvisor-port可以定义cAdvisor对外提供服务的端口,默认为4194。可以通过浏览器<Node_IP:port>访问。项目主页:http://github.com/google/cadvisor。

2. cAdvisor结构图

cAdvisor

3. Metrics

分类 字段 描述
cpu cpu_usage_total
cpu_usage_system
cpu_usage_user
cpu_usage_per_cpu
load_average Smoothed average of number of runnable threads x 1000
memory memory_usage Memory Usage
memory_working_set Working set size
network rx_bytes Cumulative count of bytes received
rx_errors Cumulative count of receive errors encountered
tx_bytes Cumulative count of bytes transmitted
tx_errors Cumulative count of transmit errors encountered
filesystem fs_device Filesystem device
fs_limit Filesystem limit
fs_usage Filesystem usage

4. cAdvisor源码

4.1. cAdvisor入口函数

cadvisor.go

func main() {
    defer glog.Flush()
    flag.Parse()
    if *versionFlag {
        fmt.Printf("cAdvisor version %s (%s)/n", version.Info["version"], version.Info["revision"])
        os.Exit(0)
    }
    setMaxProcs()
    memoryStorage, err := NewMemoryStorage()
    if err != nil {
        glog.Fatalf("Failed to initialize storage driver: %s", err)
    }
    sysFs, err := sysfs.NewRealSysFs()
    if err != nil {
        glog.Fatalf("Failed to create a system interface: %s", err)
    }
    collectorHttpClient := createCollectorHttpClient(*collectorCert, *collectorKey)
    containerManager, err := manager.New(memoryStorage, sysFs, *maxHousekeepingInterval, *allowDynamicHousekeeping, ignoreMetrics.MetricSet, &collectorHttpClient)
    if err != nil {
        glog.Fatalf("Failed to create a Container Manager: %s", err)
    }
    mux := http.NewServeMux()
    if *enableProfiling {
        mux.HandleFunc("/debug/pprof/", pprof.Index)
        mux.HandleFunc("/debug/pprof/cmdline", pprof.Cmdline)
        mux.HandleFunc("/debug/pprof/profile", pprof.Profile)
        mux.HandleFunc("/debug/pprof/symbol", pprof.Symbol)
    }
    // Register all HTTP handlers.
    err = cadvisorhttp.RegisterHandlers(mux, containerManager, *httpAuthFile, *httpAuthRealm, *httpDigestFile, *httpDigestRealm)
    if err != nil {
        glog.Fatalf("Failed to register HTTP handlers: %v", err)
    }
    cadvisorhttp.RegisterPrometheusHandler(mux, containerManager, *prometheusEndpoint, nil)
    // Start the manager.
    if err := containerManager.Start(); err != nil {
        glog.Fatalf("Failed to start container manager: %v", err)
    }
    // Install signal handler.
    installSignalHandler(containerManager)
    glog.Infof("Starting cAdvisor version: %s-%s on port %d", version.Info["version"], version.Info["revision"], *argPort)
    addr := fmt.Sprintf("%s:%d", *argIp, *argPort)
    glog.Fatal(http.ListenAndServe(addr, mux))
}

核心代码:

memoryStorage, err := NewMemoryStorage()
sysFs, err := sysfs.NewRealSysFs()
#创建containerManager
containerManager, err := manager.New(memoryStorage, sysFs, *maxHousekeepingInterval, *allowDynamicHousekeeping, ignoreMetrics.MetricSet, &collectorHttpClient)
#启动containerManager
err := containerManager.Start()

4.2. cAdvisor Client的使用

import "github.com/google/cadvisor/client"
func main(){
    client, err := client.NewClient("http://192.168.19.30:4194/")   //http://<host-ip>:<port>/
}

4.2.1 client定义

cadvisor/client/client.go

// Client represents the base URL for a cAdvisor client.
type Client struct {
    baseUrl string
}
// NewClient returns a new v1.3 client with the specified base URL.
func NewClient(url string) (*Client, error) {
    if !strings.HasSuffix(url, "/") {
        url += "/"
    }
    return &Client{
        baseUrl: fmt.Sprintf("%sapi/v1.3/", url),
    }, nil
}

4.2.2. client方法

1)MachineInfo

// MachineInfo returns the JSON machine information for this client.
// A non-nil error result indicates a problem with obtaining
// the JSON machine information data.
func (self *Client) MachineInfo() (minfo *v1.MachineInfo, err error) {
       u := self.machineInfoUrl()
       ret := new(v1.MachineInfo)
       if err = self.httpGetJsonData(ret, nil, u, "machine info"); err != nil {
              return
       }
       minfo = ret
       return
}

2)ContainerInfo

// ContainerInfo returns the JSON container information for the specified
// container and request.
func (self *Client) ContainerInfo(name string, query *v1.ContainerInfoRequest) (cinfo *v1.ContainerInfo, err error) {
       u := self.containerInfoUrl(name)
       ret := new(v1.ContainerInfo)
       if err = self.httpGetJsonData(ret, query, u, fmt.Sprintf("container info for %q", name)); err != nil {
              return
       }
       cinfo = ret
       return
}

3)DockerContainer

// Returns the JSON container information for the specified
// Docker container and request.
func (self *Client) DockerContainer(name string, query *v1.ContainerInfoRequest) (cinfo v1.ContainerInfo, err error) {
       u := self.dockerInfoUrl(name)
       ret := make(map[string]v1.ContainerInfo)
       if err = self.httpGetJsonData(&ret, query, u, fmt.Sprintf("Docker container info for %q", name)); err != nil {
              return
       }
       if len(ret) != 1 {
              err = fmt.Errorf("expected to only receive 1 Docker container: %+v", ret)
              return
       }
       for _, cont := range ret {
              cinfo = cont
       }
       return
}

4)AllDockerContainers

// Returns the JSON container information for all Docker containers.
func (self *Client) AllDockerContainers(query *v1.ContainerInfoRequest) (cinfo []v1.ContainerInfo, err error) {
       u := self.dockerInfoUrl("/")
       ret := make(map[string]v1.ContainerInfo)
       if err = self.httpGetJsonData(&ret, query, u, "all Docker containers info"); err != nil {
              return
       }
       cinfo = make([]v1.ContainerInfo, 0, len(ret))
       for _, cont := range ret {
              cinfo = append(cinfo, cont)
       }
       return
}

17.3 - Heapster介绍

1. heapster简介

Heapster是容器集群监控和性能分析工具,天然的支持Kubernetes和CoreOS。 Kubernetes有个出名的监控agent—cAdvisor。在每个kubernetes Node上都会运行cAdvisor,它会收集本机以及容器的监控数据(cpu,memory,filesystem,network,uptime)。

2. heapster部署与配置

2.1. 注意事项

需同步部署机器和被采集机器的时间:ntpdate time.windows.com

加入定时任务,定期同步时间

crontab –e

30 5 * * * /usr/sbin/ntpdate time.windows.com //每天早晨5点半执行

2.2. 容器式部署

#拉取镜像
docker pull heapster:latest
#运行容器
docker run -d -p 8082:8082 --net=host heapster:latest --source=kubernetes:http://<k8s-server-ip>:8080?inClusterConfig=false\&useServiceAccount=false --sink=influxdb:http://<influxdb-ip>:8086?db=<k8s_env_zone>

2.3. 配置说明

可以参考官方文档

2.3.1. –source

–source: 指定数据获取源。这里指定kube-apiserver即可。 后缀参数: inClusterConfig: kubeletPort: 指定kubelet的使用端口,默认10255 kubeletHttps: 是否使用https去连接kubelets(默认:false) apiVersion: 指定K8S的apiversion insecure: 是否使用安全证书(默认:false) auth: 安全认证 useServiceAccount: 是否使用K8S的安全令牌

2.3.2. –sink

–sink: 指定后端数据存储。这里指定influxdb数据库。 后缀参数: user: InfluxDB用户 pw: InfluxDB密码 db: 数据库名 secure: 安全连接到InfluxDB(默认:false) withfields: 使用InfluxDB fields(默认:false)。

3. Metrics

分类 Metric Name Description 备注
cpu cpu/limit CPU hard limit in millicores. CPU上限
cpu/node_capacity Cpu capacity of a node. Node节点的CPU容量
cpu/node_allocatable Cpu allocatable of a node. Node节点可分配的CPU
cpu/node_reservation Share of cpu that is reserved on the node allocatable.
cpu/node_utilization CPU utilization as a share of node allocatable.
cpu/request CPU request (the guaranteed amount of resources) in millicores.
cpu/usage Cumulative CPU usage on all cores. CPU总使用量
cpu/usage_rate CPU usage on all cores in millicores.
filesystem filesystem/usage Total number of bytes consumed on a filesystem. 文件系统的使用量
filesystem/limit The total size of filesystem in bytes. 文件系统的使用上限
filesystem/available The number of available bytes remaining in a the filesystem 可用的文件系统容量
filesystem/inodes The number of available inodes in a the filesystem
filesystem/inodes_free The number of free inodes remaining in a the filesystem
memory memory/limit Memory hard limit in bytes. 内存上限
memory/major_page_faults Number of major page faults.
memory/major_page_faults_rate Number of major page faults per second.
memory/node_capacity Memory capacity of a node.
memory/node_allocatable Memory allocatable of a node.
memory/node_reservation Share of memory that is reserved on the node allocatable.
memory/node_utilization Memory utilization as a share of memory allocatable.
memory/page_faults Number of page faults.
memory/page_faults_rate Number of page faults per second.
memory/request Memory request (the guaranteed amount of resources) in bytes.
memory/usage Total memory usage.
memory/cache Cache memory usage.
memory/rss RSS memory usage.
memory/working_set Total working set usage. Working set is the memory being used and not easily dropped by the kernel.
network network/rx Cumulative number of bytes received over the network.
network/rx_errors Cumulative number of errors while receiving over the network.
network/rx_errors_rate Number of errors while receiving over the network per second.
network/rx_rate Number of bytes received over the network per second.
network/tx Cumulative number of bytes sent over the network
network/tx_errors Cumulative number of errors while sending over the network
network/tx_errors_rate Number of errors while sending over the network
network/tx_rate Number of bytes sent over the network per second.
uptime Number of milliseconds since the container was started. -

4. Labels

Label Name Description
pod_id Unique ID of a Pod
pod_name User-provided name of a Pod
pod_namespace The namespace of a Pod
container_base_image Base image for the container
container_name User-provided name of the container or full cgroup name for system containers
host_id Cloud-provider specified or user specified Identifier of a node
hostname Hostname where the container ran
labels Comma-separated(Default) list of user-provided labels. Format is 'key:value'
namespace_id UID of the namespace of a Pod
resource_id A unique identifier used to differentiate multiple metrics of the same type. e.x. Fs partitions under filesystem/usage

5. heapster API

见官方文档:https://github.com/kubernetes/heapster/blob/master/docs/model.md

17.4 - Influxdb介绍

1. InfluxDB简介

InfluxDB是一个当下比较流行的时序数据库,InfluxDB使用 Go 语言编写,无需外部依赖,安装配置非常方便,适合构建大型分布式系统的监控系统。

主要特色功能:

1)基于时间序列,支持与时间有关的相关函数(如最大,最小,求和等)

2)可度量性:你可以实时对大量数据进行计算

3)基于事件:它支持任意的事件数据

2. InfluxDB安装

1)安装

wget https://dl.influxdata.com/influxdb/releases/influxdb-0.13.0.x86_64.rpm

yum localinstall influxdb-0.13.0.armhf.rpm

2)启动

service influxdb start

3)访问

http://服务器IP:8083

4)docker image方式安装

docker pull influxdb

docker run -d -p 8083:8083 -p 8086:8086 --expose 8090 --expose 8099 --volume=/opt/data/influxdb:/data --name influxsrv influxdb:latest

3. InfluxDB的基本概念

3.1. 与传统数据库中的名词做比较

influxDB中的名词 传统数据库中的概念
database 数据库
measurement 数据库中的表
points 表里面的一行数据

3.2. InfluxDB中独有的概念

3.2.1. Point

Point由时间戳(time)、数据(field)、标签(tags)组成。

Point相当于传统数据库里的一行数据,如下表所示:

Point属性 传统数据库中的概念
time 每个数据记录时间,是数据库中的主索引(会自动生成)
fields 各种记录值(没有索引的属性)也就是记录的值:温度, 湿度
tags 各种有索引的属性:地区,海拔

3.2.2. series

所有在数据库中的数据,都需要通过图表来展示,而这个series表示这个表里面的数据,可以在图表上画成几条线:通过tags排列组合算出来

show series from cpu

4. InfluxDB的基本操作

InfluxDB提供三种操作方式:

1)客户端命令行方式

2)HTTP API接口

3)各语言API库

4.1. InfluxDB数据库操作

操作 命令
显示数据库 show databases
创建数据库 create database db_name
删除数据库 drop database db_name
使用某个数据库 use db_name

4.2. InfluxDB数据表操作

操作 命令 说明
显示所有表 SHOW MEASUREMENTS
创建数据表 insert table_name,hostname=server01 value=442221834240i 1435362189575692182 其中 disk_free 就是表名,hostname是索引,value=xx是记录值,记录值可以有多个,最后是指定的时间
删除数据表 drop measurement table_name
查看表内容 select * from table_name
查看series show series from table_name series表示这个表里面的数据,可以在图表上画成几条线,series主要通过tags排列组合算出来